WEBVTT
00:00:00.030 --> 00:00:03.870
The regression analysis basically
draws a line through the data.
00:00:03.870 --> 00:00:08.130
And the line is defined by regression
coefficients or the beta's in the model.
00:00:08.640 --> 00:00:13.710
Our task now is to figure out
how we estimate those betas.
00:00:13.710 --> 00:00:18.720
So we give regression analysis some data
of the dependent variable, and one or
00:00:18.720 --> 00:00:25.410
more independent variables, and the regression
analysis tells us where the best line goes.
00:00:25.410 --> 00:00:29.760
The line is defined by the basis.
How does the analysis know which line is the best?
00:00:29.760 --> 00:00:33.060
To answer that question we'll
be looking at some example data.
00:00:33.060 --> 00:00:37.710
This same data set is used in
one of the assignments and it
00:00:37.710 --> 00:00:40.800
comes from the census of Canada from early 70s.
00:00:40.800 --> 00:00:44.340
The data set is called Prestige and we have,
00:00:44.340 --> 00:00:48.210
the observations here are
occupations the 102 of those.
00:00:48.210 --> 00:00:51.450
And we have data for the education which
00:00:51.450 --> 00:00:58.890
is the mean number of years that
occupants of that occupation hold.
00:00:58.890 --> 00:01:01.200
What's the average income for that occupation?
00:01:02.580 --> 00:01:07.590
How many women there are from
0 to 100% in that occupation?
00:01:07.590 --> 00:01:13.080
What is the prestigiousness on a prestigious score
00:01:13.080 --> 00:01:15.840
that is defined some way that
we don't really care about?
00:01:15.840 --> 00:01:19.260
We have sensors code which is an
identifier that we don't need.
00:01:19.260 --> 00:01:23.910
And we have type which is a categorical
variable that can be either white color,
00:01:23.910 --> 00:01:25.320
blue color a professional.
00:01:25.320 --> 00:01:28.950
Then there's some information about
where the data comes from and this
00:01:28.950 --> 00:01:34.740
is a printout from the R packages or CAR
documentation that contains this dataset.
00:01:34.740 --> 00:01:43.620
So we will be doing a regression estimation and
our task is to explain prestige with education.
00:01:43.620 --> 00:01:48.810
How much the prestigiousness is of
an occupation depends on the amount
00:01:48.810 --> 00:01:52.800
of all education in years that
is required for that occupation.
00:01:53.790 --> 00:01:59.340
Our regression model, we said that regression
the prestige is a weighted sum of beta 0 the
00:01:59.340 --> 00:02:04.110
intercept or at the base level for people with
no education or professors with no education.
00:02:04.110 --> 00:02:11.010
Beta 1 which is the effect of education
plus some variation that the model
00:02:11.010 --> 00:02:17.520
doesn't explain the error term u.
Our task is to estimate beta0 and
00:02:17.520 --> 00:02:23.970
beta1 which define the regression line.
And estimates in statistical analysis
00:02:23.970 --> 00:02:30.360
are usually denoted by drawing this kind
of carrot or hat symbol over the beta.
00:02:30.360 --> 00:02:34.260
So this is beta hat 0, this is beta hat 1.
00:02:34.260 --> 00:02:38.250
They are estimates of this
population regression model.
00:02:38.250 --> 00:02:44.280
So hat denotes that we don't know a value but
we have calculated the value from a sample.
00:02:44.280 --> 00:02:48.630
And that serves as an estimate for what
is the relationship in the population.
00:02:48.630 --> 00:02:56.100
Now we need to have a rule to set the line.
We have drawn the regression line here and the
00:02:56.100 --> 00:02:58.980
regression line should go through
the middle of the observations.
00:02:58.980 --> 00:03:03.930
So that there are about the same amount
observations above the line and below the line.
00:03:03.930 --> 00:03:08.280
And the observations also are assumed to
be normally distributed around the line.
00:03:08.280 --> 00:03:14.820
So that most observations are clustered closer,
and the line and, some are further from the line.
00:03:14.820 --> 00:03:19.740
Telling a person to draw a
line in the middle is easy,
00:03:19.740 --> 00:03:21.600
and person will probably draw a line like that.
00:03:21.600 --> 00:03:25.230
But you can't tell a computer to draw a line
00:03:25.230 --> 00:03:27.840
in the middle because in the
middle is not well-defined.
00:03:28.410 --> 00:03:32.370
You have to have a specific
rule on how to draw the line.
00:03:32.370 --> 00:03:36.630
And that specific rule of
estimation is called an estimator.
00:03:36.630 --> 00:03:43.080
So estimator is any rule or strategy or
principle that we can apply to calculate
00:03:43.080 --> 00:03:48.390
values for the quantities of
interest from a sample data.
00:03:50.610 --> 00:03:54.810
Let's take a look at what are some
properties of good estimators.
00:03:54.810 --> 00:03:58.560
We covered this in another
video before but let's revise.
00:03:58.560 --> 00:04:02.520
So we need to have estimates that are consistent.
00:04:02.520 --> 00:04:10.620
Consistency means that when we have the full
population data, our estimates beta hat 0
00:04:10.620 --> 00:04:18.870
and beta hat 1 or equal to beta 0 and beta 1.
So these estimates will be the population values.
00:04:18.870 --> 00:04:22.710
In other words if we have the
full data a consistent estimator
00:04:22.710 --> 00:04:26.040
gives us the correct result for that population.
00:04:26.040 --> 00:04:32.910
Then we have another important which is
unbiased, is another important property.
00:04:32.910 --> 00:04:37.530
Unbiased, this means that even if we don't
have the full data set, a large sample,
00:04:37.530 --> 00:04:43.440
our estimates will be correct on average
if we repeat the study over and over.
00:04:43.440 --> 00:04:46.890
So the estimates are correct
on average is unbiasness.
00:04:46.890 --> 00:04:50.820
Then we have efficiency which
means that the estimates are
00:04:50.820 --> 00:04:56.460
more precise or more accurate than
any possible alternative estimator.
00:04:56.460 --> 00:05:03.150
So efficiency is a property that we can use
to compare to estimators that are unbiased.
00:05:03.150 --> 00:05:07.680
Finally the repeated estimates from repeated
00:05:07.680 --> 00:05:12.390
samples should be normally distributed
or at least follow a known distribution.
00:05:12.390 --> 00:05:17.670
And that is important for statistical
inference or calculating the p-values.
00:05:17.670 --> 00:05:24.900
One really good rule for
estimating the regression model,
00:05:24.900 --> 00:05:29.550
actually the best rule, is to use the residuals.
00:05:29.550 --> 00:05:34.170
And when we have a regression
line here, we can see that the
00:05:34.170 --> 00:05:40.950
observations are not exactly on the line.
Instead they are somewhere around the line.
00:05:40.950 --> 00:05:43.560
And the line is that the perfect prediction.
00:05:43.560 --> 00:05:49.110
So this is the amount of our income that would
be predicted based on your education level.
00:05:49.110 --> 00:05:53.040
And then the difference between the actual income
00:05:53.040 --> 00:05:57.690
and the predicted income by the
model is called the residual.
00:05:57.690 --> 00:06:02.190
So that is the part that the model
doesn't explain of the dependent variable.
00:06:02.190 --> 00:06:12.090
We can calculate this regression line by plugging
in our estimator for beta0 and beta1 x education.
00:06:12.090 --> 00:06:17.010
That gives the line and
then these whatever remains.
00:06:17.010 --> 00:06:21.300
What's the difference between a line
and the observation is the residual.
00:06:21.300 --> 00:06:30.720
So the great or best rule for estimating this
regression model is to find the center line so
00:06:30.720 --> 00:06:36.900
that the sum of these residuals raised to
the second power is as small as possible.
00:06:36.900 --> 00:06:43.950
So how we do it in practice.
We set the line somewhere, we calculate residuals
00:06:43.950 --> 00:06:51.030
for each observation, we erased each residual
to the second power, we take a sum and then we
00:06:51.030 --> 00:06:57.240
try different values for the beta's to make the
sum of squared residuals as small as possible.
00:06:57.240 --> 00:07:01.500
This is called the ordinary
least-squares estimator and
00:07:01.500 --> 00:07:04.560
it has been proven to be consistent unbiased.
00:07:04.560 --> 00:07:08.400
And efficient and it produces
normally distributed estimates
00:07:08.400 --> 00:07:11.130
understand assumptions that we will cover later.