WEBVTT

00:00:00.030 --> 00:00:03.870
The regression analysis basically&nbsp;
draws a line through the data.&nbsp;

00:00:03.870 --> 00:00:08.130
And the line is defined by regression&nbsp;
coefficients or the beta's in the model.

00:00:08.640 --> 00:00:13.710
Our task now is to figure out&nbsp;
how we estimate those betas.&nbsp;

00:00:13.710 --> 00:00:18.720
So we give regression analysis some data&nbsp;
of the dependent variable, and one or&nbsp;&nbsp;

00:00:18.720 --> 00:00:25.410
more independent variables, and the regression&nbsp;
analysis tells us where the best line goes.

00:00:25.410 --> 00:00:29.760
The line is defined by the basis.
How does the analysis know which line is the best?

00:00:29.760 --> 00:00:33.060
To answer that question we'll&nbsp;
be looking at some example data.

00:00:33.060 --> 00:00:37.710
This same data set is used in&nbsp;
one of the assignments and it&nbsp;&nbsp;

00:00:37.710 --> 00:00:40.800
comes from the census of Canada from early 70s.

00:00:40.800 --> 00:00:44.340
The data set is called Prestige and we have,&nbsp;&nbsp;

00:00:44.340 --> 00:00:48.210
the observations here are&nbsp;
occupations the 102 of those.

00:00:48.210 --> 00:00:51.450
And we have data for the education which&nbsp;&nbsp;

00:00:51.450 --> 00:00:58.890
is the mean number of years that&nbsp;
occupants of that occupation hold.

00:00:58.890 --> 00:01:01.200
What's the average income for that occupation?

00:01:02.580 --> 00:01:07.590
How many women there are from&nbsp;
0 to 100% in that occupation?&nbsp;

00:01:07.590 --> 00:01:13.080
What is the prestigiousness on a prestigious score&nbsp;&nbsp;

00:01:13.080 --> 00:01:15.840
that is defined some way that&nbsp;
we don't really care about?

00:01:15.840 --> 00:01:19.260
We have sensors code which is an&nbsp;
identifier that we don't need.

00:01:19.260 --> 00:01:23.910
And we have type which is a categorical&nbsp;
variable that can be either white color,&nbsp;&nbsp;

00:01:23.910 --> 00:01:25.320
blue color a professional.

00:01:25.320 --> 00:01:28.950
Then there's some information about&nbsp;
where the data comes from and this&nbsp;&nbsp;

00:01:28.950 --> 00:01:34.740
is a printout from the R packages or CAR&nbsp;
documentation that contains this dataset.

00:01:34.740 --> 00:01:43.620
So we will be doing a regression estimation and&nbsp;
our task is to explain prestige with education.

00:01:43.620 --> 00:01:48.810
How much the prestigiousness is of&nbsp;
an occupation depends on the amount&nbsp;&nbsp;

00:01:48.810 --> 00:01:52.800
of all education in years that&nbsp;
is required for that occupation.

00:01:53.790 --> 00:01:59.340
Our regression model, we said that regression&nbsp;
the prestige is a weighted sum of beta 0 the&nbsp;&nbsp;

00:01:59.340 --> 00:02:04.110
intercept or at the base level for people with&nbsp;
no education or professors with no education.&nbsp;

00:02:04.110 --> 00:02:11.010
Beta 1 which is the effect of education&nbsp;
plus some variation that the model&nbsp;&nbsp;

00:02:11.010 --> 00:02:17.520
doesn't explain the error term u.
Our task is to estimate beta0 and&nbsp;&nbsp;

00:02:17.520 --> 00:02:23.970
beta1 which define the regression line.
And estimates in statistical analysis&nbsp;&nbsp;

00:02:23.970 --> 00:02:30.360
are usually denoted by drawing this kind&nbsp;
of carrot or hat symbol over the beta.

00:02:30.360 --> 00:02:34.260
So this is beta hat 0, this is beta hat 1.

00:02:34.260 --> 00:02:38.250
They are estimates of this&nbsp;
population regression model.

00:02:38.250 --> 00:02:44.280
So hat denotes that we don't know a value but&nbsp;
we have calculated the value from a sample.&nbsp;

00:02:44.280 --> 00:02:48.630
And that serves as an estimate for what&nbsp;
is the relationship in the population.

00:02:48.630 --> 00:02:56.100
Now we need to have a rule to set the line.
We have drawn the regression line here and the&nbsp;&nbsp;

00:02:56.100 --> 00:02:58.980
regression line should go through&nbsp;
the middle of the observations.

00:02:58.980 --> 00:03:03.930
So that there are about the same amount&nbsp;
observations above the line and below the line.

00:03:03.930 --> 00:03:08.280
And the observations also are assumed to&nbsp;
be normally distributed around the line.&nbsp;

00:03:08.280 --> 00:03:14.820
So that most observations are clustered closer,&nbsp;
and the line and, some are further from the line.&nbsp;

00:03:14.820 --> 00:03:19.740
Telling a person to draw a&nbsp;
line in the middle is easy,&nbsp;&nbsp;

00:03:19.740 --> 00:03:21.600
and person will probably draw a line like that.

00:03:21.600 --> 00:03:25.230
But you can't tell a computer to draw a line&nbsp;&nbsp;

00:03:25.230 --> 00:03:27.840
in the middle because in the&nbsp;
middle is not well-defined.&nbsp;

00:03:28.410 --> 00:03:32.370
You have to have a specific&nbsp;
rule on how to draw the line.&nbsp;

00:03:32.370 --> 00:03:36.630
And that specific rule of&nbsp;
estimation is called an estimator.

00:03:36.630 --> 00:03:43.080
So estimator is any rule or strategy or&nbsp;
principle that we can apply to calculate&nbsp;&nbsp;

00:03:43.080 --> 00:03:48.390
values for the quantities of&nbsp;
interest from a sample data.

00:03:50.610 --> 00:03:54.810
Let's take a look at what are some&nbsp;
properties of good estimators.

00:03:54.810 --> 00:03:58.560
We covered this in another&nbsp;
video before but let's revise.

00:03:58.560 --> 00:04:02.520
So we need to have estimates that are consistent.

00:04:02.520 --> 00:04:10.620
Consistency means that when we have the full&nbsp;
population data, our estimates beta hat 0&nbsp;&nbsp;

00:04:10.620 --> 00:04:18.870
and beta hat 1 or equal to beta 0 and beta 1.
So these estimates will be the population values.

00:04:18.870 --> 00:04:22.710
In other words if we have the&nbsp;
full data a consistent estimator&nbsp;&nbsp;

00:04:22.710 --> 00:04:26.040
gives us the correct result for that population.

00:04:26.040 --> 00:04:32.910
Then we have another important which is&nbsp;
unbiased, is another important property.

00:04:32.910 --> 00:04:37.530
Unbiased, this means that even if we don't&nbsp;
have the full data set, a large sample,&nbsp;&nbsp;

00:04:37.530 --> 00:04:43.440
our estimates will be correct on average&nbsp;
if we repeat the study over and over.&nbsp;

00:04:43.440 --> 00:04:46.890
So the estimates are correct&nbsp;
on average is unbiasness.

00:04:46.890 --> 00:04:50.820
Then we have efficiency which&nbsp;
means that the estimates are&nbsp;&nbsp;

00:04:50.820 --> 00:04:56.460
more precise or more accurate than&nbsp;
any possible alternative estimator.

00:04:56.460 --> 00:05:03.150
So efficiency is a property that we can use&nbsp;
to compare to estimators that are unbiased.

00:05:03.150 --> 00:05:07.680
Finally the repeated estimates from repeated&nbsp;&nbsp;

00:05:07.680 --> 00:05:12.390
samples should be normally distributed&nbsp;
or at least follow a known distribution.

00:05:12.390 --> 00:05:17.670
And that is important for statistical&nbsp;
inference or calculating the p-values.

00:05:17.670 --> 00:05:24.900
One really good rule for&nbsp;
estimating the regression model,&nbsp;&nbsp;

00:05:24.900 --> 00:05:29.550
actually the best rule, is to use the residuals.

00:05:29.550 --> 00:05:34.170
And when we have a regression&nbsp;
line here, we can see that the&nbsp;&nbsp;

00:05:34.170 --> 00:05:40.950
observations are not exactly on the line.
Instead they are somewhere around the line.

00:05:40.950 --> 00:05:43.560
And the line is that the perfect prediction.

00:05:43.560 --> 00:05:49.110
So this is the amount of our income that would&nbsp;
be predicted based on your education level.&nbsp;

00:05:49.110 --> 00:05:53.040
And then the difference between the actual income&nbsp;&nbsp;

00:05:53.040 --> 00:05:57.690
and the predicted income by the&nbsp;
model is called the residual.

00:05:57.690 --> 00:06:02.190
So that is the part that the model&nbsp;
doesn't explain of the dependent variable.

00:06:02.190 --> 00:06:12.090
We can calculate this regression line by plugging&nbsp;
in our estimator for beta0 and beta1 x education.&nbsp;

00:06:12.090 --> 00:06:17.010
That gives the line and&nbsp;
then these whatever remains.&nbsp;

00:06:17.010 --> 00:06:21.300
What's the difference between a line&nbsp;
and the observation is the residual.

00:06:21.300 --> 00:06:30.720
So the great or best rule for estimating this&nbsp;
regression model is to find the center line so&nbsp;&nbsp;

00:06:30.720 --> 00:06:36.900
that the sum of these residuals raised to&nbsp;
the second power is as small as possible.

00:06:36.900 --> 00:06:43.950
So how we do it in practice.
We set the line somewhere, we calculate residuals&nbsp;&nbsp;

00:06:43.950 --> 00:06:51.030
for each observation, we erased each residual&nbsp;
to the second power, we take a sum and then we&nbsp;&nbsp;

00:06:51.030 --> 00:06:57.240
try different values for the beta's to make the&nbsp;
sum of squared residuals as small as possible.

00:06:57.240 --> 00:07:01.500
This is called the ordinary&nbsp;
least-squares estimator and&nbsp;&nbsp;

00:07:01.500 --> 00:07:04.560
it has been proven to be consistent unbiased.

00:07:04.560 --> 00:07:08.400
And efficient and it produces&nbsp;
normally distributed estimates&nbsp;&nbsp;

00:07:08.400 --> 00:07:11.130
understand assumptions that we will cover later.