WEBVTT
00:00:00.180 --> 00:00:06.270
This video will introduce you the
regression analysis assumptions,
00:00:06.270 --> 00:00:12.030
or specifically, the assumptions that
least squares estimation principle assumes.
00:00:12.030 --> 00:00:15.150
So the idea of least square estimation
00:00:15.150 --> 00:00:19.860
or regression model is that we
have one dependent variable y.
00:00:19.860 --> 00:00:23.250
And in this example, we have
one independent variable x.
00:00:23.250 --> 00:00:28.140
And we draw a line through the middle of
the data the scatter plot of the data.
00:00:28.140 --> 00:00:30.930
And regression analysis assumes that
00:00:30.930 --> 00:00:34.770
these observations are equally
spread out around this line.
00:00:34.770 --> 00:00:37.500
So that the dispersion of observations the
00:00:37.500 --> 00:00:42.120
same here, as is the dispersion of
observations here are on the line.
00:00:42.120 --> 00:00:46.590
So they each individual observation in
our data, falls somewhere on this line,
00:00:46.590 --> 00:00:50.730
some go exactly the line some
go a bit further from the line.
00:00:50.730 --> 00:00:54.630
We also assume that our,
when we know that x is one,
00:00:54.630 --> 00:01:00.420
then the values of y are normally
distributed here on the regression line.
00:01:00.420 --> 00:01:04.050
So that's basically a summary of the assumptions.
00:01:04.050 --> 00:01:09.330
And now we will take a look at
specific parts of those assumptions.
00:01:09.330 --> 00:01:13.650
Before we do so we have to talk a
bit about what the assumptions mean,
00:01:13.650 --> 00:01:14.970
because there are some misconceptions.
00:01:14.970 --> 00:01:19.920
For example, sometimes students in my classes say
00:01:19.920 --> 00:01:24.030
that an estimation technique requires
that the data normally distributed,
00:01:24.030 --> 00:01:28.980
and they think it implies that an
estimation technique can be applied
00:01:28.980 --> 00:01:33.390
when the data are not normal.
That has two problems.
00:01:33.390 --> 00:01:37.350
First of all, we rarely make assumptions
about the distribution of observed data.
00:01:37.350 --> 00:01:44.850
And second, the fact that an assumption doesn't
00:01:44.850 --> 00:01:49.230
hold exactly doesn't mean that the
estimator is immediately useless.
00:01:49.230 --> 00:01:52.800
Let's start with examples
of models and estimators.
00:01:52.800 --> 00:01:54.720
So we understand what assumptions mean.
00:01:54.720 --> 00:01:56.520
So here's the regression model.
00:01:56.520 --> 00:02:01.200
It's that y is a weighted sum of x's
that observe independent variables,
00:02:01.200 --> 00:02:04.470
plus some error term u that
the model doesn't explain.
00:02:04.470 --> 00:02:07.890
Then we have estimates and principles,
00:02:07.890 --> 00:02:11.040
how do we choose the betas,
which set of betas is the best.
00:02:11.040 --> 00:02:17.490
And one good rule is the OLS rule,
minimize the sum of squared residuals.
00:02:17.490 --> 00:02:20.010
So we choose the betas here.
00:02:20.010 --> 00:02:22.140
So that the sum of squared residuals,
00:02:22.140 --> 00:02:28.200
what is the difference between the observed
value y and the fitted value from the betas,
00:02:28.200 --> 00:02:33.810
is as small as possible so that that's
what we are me discussing this part.
00:02:33.810 --> 00:02:36.570
But that's not the only way of
estimating a regression model.
00:02:36.570 --> 00:02:40.590
For example, we could use weighted least squares.
00:02:40.590 --> 00:02:44.580
So weighted least squares is the same as OLS,
00:02:44.580 --> 00:02:48.360
except that instead of minimizing
a sum of squared residuals,
00:02:48.360 --> 00:02:54.000
we minimize the weighted sum of squared
residuals or sum of weighted squared residuals.
00:02:54.000 --> 00:02:59.430
The idea of weighted least squares
is that some observations provide
00:02:59.430 --> 00:03:03.210
us more information about the
regression line goes than others.
00:03:03.210 --> 00:03:08.580
And in some scenarios, the weighted
least squares is better than OLS.
00:03:08.580 --> 00:03:10.830
To understand what those scenarios are,
00:03:10.830 --> 00:03:14.730
we have to understand the assumptions that
but that's not all, we have also others.
00:03:14.730 --> 00:03:19.020
So there's feasible generalized least squares,
00:03:19.020 --> 00:03:24.840
which is the same as weighted least squares
that estimates the weights from the data.
00:03:24.840 --> 00:03:29.010
So that makes a bit less assumptions that weighted
least squares and there are trade offs in that.
00:03:29.010 --> 00:03:34.140
We have also our interative
weighted least squares or IRLS.
00:03:34.140 --> 00:03:39.300
The idea of IRLS is that it
weights, the residuals interatively.
00:03:39.300 --> 00:03:45.000
And the weights from for the next interation
are based on the previous interation.
00:03:45.000 --> 00:03:50.220
And this is a good technique when you have outlier
observations that I talk about in another video.
00:03:50.220 --> 00:03:56.250
So all of these techniques can be used in
different scenarios, they all work reasonably
00:03:56.250 --> 00:04:01.800
well, in some conditions, in some conditions one
of these rules is clearly better than others.
00:04:01.800 --> 00:04:04.440
To understand that we have to
understand the assumptions.
00:04:04.440 --> 00:04:09.360
Also the models, we can use different models.
00:04:09.360 --> 00:04:13.860
So it's, the regression model is
not necessarily the best model.
00:04:13.860 --> 00:04:20.460
For example, we could instead of regression model,
we could apply a generalized linear model which
00:04:20.460 --> 00:04:25.050
takes the fitted values for regression
analysis applies a function there.
00:04:25.050 --> 00:04:29.220
And then it doesn't make the assumption
that observations are normally distributed.
00:04:29.220 --> 00:04:32.310
So that's, that's one alternative model.
00:04:32.310 --> 00:04:37.410
So you can choose either alternative model
or alternative estimator, when your data
00:04:37.410 --> 00:04:42.750
doesn't really fit into the model estimates
combination that you're you're planning to use.
00:04:42.750 --> 00:04:45.630
Here's another one, this is a multi level model.
00:04:45.630 --> 00:04:50.568
And this would be applicable when you
have for example, longitudinal data.
00:04:50.568 --> 00:04:53.010
So you have multiple
observations for each company,
00:04:53.010 --> 00:04:56.430
and many companies in the data
and you assume that there are
00:04:56.430 --> 00:05:00.480
some constant differences between
companies that persist over time.
00:05:00.480 --> 00:05:03.780
And then you would use that
kind of model because you are
00:05:03.780 --> 00:05:08.340
in violation of the random sampling
assumption in regression analysis.
00:05:08.340 --> 00:05:12.270
So there are different
things that that you can use,
00:05:12.270 --> 00:05:18.330
I recommend always as default option to go
with regression analysis and OLS estimation,
00:05:18.330 --> 00:05:22.080
if you have a good reason to use something else,
then do that.
00:05:22.080 --> 00:05:27.990
But start with OLS and regression model,
because it will tell you something about the
00:05:27.990 --> 00:05:31.920
data that you didn't know before estimation.
And it's quick to calculate.
00:05:31.920 --> 00:05:34.230
Then you go to more complicated things,
00:05:34.230 --> 00:05:41.220
if specific assumptions of OLS don't
really fit into your research scenario.
00:05:41.220 --> 00:05:48.720
Okay, so the assumptions are something that we do
so assumptions are required for certain proofs.
00:05:48.720 --> 00:05:55.710
So, when we say that the OLS requires that
the error term is normally distributed,
00:05:55.710 --> 00:06:01.350
it means that it has been proven
that OLS is consistent, unbiased,
00:06:01.350 --> 00:06:03.390
efficient, and the estimates are normal,
00:06:03.390 --> 00:06:06.390
when among other assumptions, the
error term is normally distributed.
00:06:06.390 --> 00:06:10.140
So certain proofs require these assumptions.
00:06:10.140 --> 00:06:15.540
If we can't assume certain proofs, certain
things, then the proof can't be done.
00:06:15.540 --> 00:06:18.150
So, if the error term is not normally distributed,
00:06:18.150 --> 00:06:26.670
then we cannot prove that the OLS
estimator is on unbiased in small samples.
00:06:26.670 --> 00:06:28.800
It could be but we can't prove it.
00:06:28.800 --> 00:06:36.660
So these assumptions imply one important
thing and they don't imply another thing.
00:06:36.660 --> 00:06:44.640
So what they do imply is that the estimator is
useful when we are close to this ideal conditions.
00:06:44.640 --> 00:06:51.360
So regression analysis assumes that the
relationships in the data are linear,
00:06:51.360 --> 00:06:55.290
if they are close to linear,
but not exactly linear,
00:06:55.290 --> 00:06:57.660
regression analysis will be a useful tool.
00:06:57.660 --> 00:07:00.030
So these assumptions don't have to hold exactly.
00:07:00.030 --> 00:07:05.820
If they are close enough, then
we will get still good results.
00:07:05.820 --> 00:07:13.650
Also, it doesn't imply that if an estimator has
been proven to be consistent under some scenario,
00:07:13.650 --> 00:07:17.880
then it's immediately useless in other scenarios.
00:07:17.880 --> 00:07:20.820
So the fact that something has been proven in
00:07:20.820 --> 00:07:25.350
one condition doesn't mean that it
can not work in another condition.
00:07:25.350 --> 00:07:29.700
But it's important to understand the
limitations of these different techniques.
00:07:29.700 --> 00:07:36.720
And for that, we test the assumptions
typically after we do our analysis.
00:07:36.720 --> 00:07:42.450
Now that we have understood that the assumptions
are something that should ideally hold,
00:07:42.450 --> 00:07:46.500
but in practice, they hold only approximately.
00:07:46.500 --> 00:07:51.660
And also we have understood that
because we are in violation of,
00:07:51.660 --> 00:07:55.440
for example, the normality of
assumption in regression analysis,
00:07:55.440 --> 00:07:58.140
it doesn't necessarily have
any severe consequences,
00:07:58.140 --> 00:08:00.360
it just means that certain things can be proven,
00:08:00.360 --> 00:08:03.630
the thing that we can't prove could still be true.
00:08:03.630 --> 00:08:06.300
Let's take a look at there are actual assumptions.
00:08:06.300 --> 00:08:14.220
Regression analysis requires four
assumptions to provide or OLS estimates
00:08:14.220 --> 00:08:18.480
requires four assumptions to provide
you consistent and unbiased estimates.
00:08:18.480 --> 00:08:22.050
And the unbiasedness property
here refers to any sample size.
00:08:22.050 --> 00:08:25.860
So regression analysis is unbiased
regardless of the sample size.
00:08:25.860 --> 00:08:29.070
You can get unbiased estimates,
with sample of 10 observations.
00:08:29.070 --> 00:08:32.490
The estimates will be very precise,
but they're still unbiased.
00:08:32.490 --> 00:08:35.790
The first assumption is
that we have a linear model.
00:08:35.790 --> 00:08:38.820
So that assumption basically
just defines the model.
00:08:38.820 --> 00:08:41.700
And that's, that's all there is to it.
00:08:41.700 --> 00:08:44.550
Then the second assumption is random sampling.
00:08:44.550 --> 00:08:47.790
So random sampling means that
all observations are independent.
00:08:47.790 --> 00:08:55.230
And each observation in the population has equal
probability in getting selected to the sample.
00:08:55.230 --> 00:08:57.720
This is a feature of your research design.
00:08:57.720 --> 00:09:00.930
And it can't really be
tested empirically directly,
00:09:01.500 --> 00:09:06.090
you can test it in some aspects
of this random sampling.
00:09:06.090 --> 00:09:07.950
And I will talk about that later.
00:09:07.950 --> 00:09:11.970
Then we have two other assumptions.
00:09:11.970 --> 00:09:15.315
Assumption three is, there's
no perfect collinearity.
00:09:15.315 --> 00:09:18.360
It's so perfect collinearity is
different from multicollinearity.
00:09:18.360 --> 00:09:25.140
Perfect collinearity means that if
that one or more of the variables,
00:09:25.140 --> 00:09:30.600
independent variables in the model are completely
determined by another independent variable.
00:09:30.600 --> 00:09:38.070
So for example, if we have three dummy variables,
then we that define a categorical variable.
00:09:38.070 --> 00:09:42.210
If we know two values for the dummies,
then we can infer the third.
00:09:42.210 --> 00:09:48.210
That assumption requires that every
new observed new variable that we
00:09:48.210 --> 00:09:51.750
enter that the model, brings new
information about the phenomena.
00:09:51.750 --> 00:09:56.040
If we know that, let's use gender as an example.
00:09:56.040 --> 00:09:59.580
We only need to know whether
a person is or is not a male.
00:09:59.580 --> 00:10:06.240
If he is not a male, then we know that he's
a female, or she's a female, then having
00:10:06.240 --> 00:10:11.010
a variable for male or having a variable
for female, would be perfectly collinear.
00:10:11.010 --> 00:10:15.480
Because knowing whether a person
is a man automatically tells you
00:10:15.480 --> 00:10:17.760
whether the same person is a woman or not.
00:10:17.760 --> 00:10:20.040
So that's the perfect collinearity.
00:10:20.040 --> 00:10:26.310
The zero conditional mean, this is
a technical way of expressing it,
00:10:26.310 --> 00:10:29.580
but it basically tells you that we assume that
00:10:29.580 --> 00:10:34.830
the error term is uncorrelated
with all explanatory variables.
00:10:34.830 --> 00:10:40.950
And this is a bit more complicated assumption
that I'll explained in another video,
00:10:40.950 --> 00:10:45.210
but this is also referred to as
the no endogeneity assumption.
00:10:45.210 --> 00:10:49.860
And if we look at this diagram
of regression analysis,
00:10:49.860 --> 00:10:59.640
then this assumption number four can be understood
as that where this distribution is located,
00:10:59.640 --> 00:11:03.330
doesn't depend on the regression line.
00:11:03.330 --> 00:11:09.120
So the distribution is always exactly at
the regression line, instead of for example,
00:11:09.120 --> 00:11:14.070
the line going here, and the observations
being somewhere here normally distributed.
00:11:14.070 --> 00:11:19.470
So that is called the no endogeneity
assumption and endogeneity is a big issue.
00:11:19.470 --> 00:11:23.730
If we want to make causal
claims using observational data,
00:11:23.730 --> 00:11:26.100
I'll return to that in another video.
00:11:26.100 --> 00:11:31.980
So under these four assumptions,
OLS, is unbiased and consistent,
00:11:31.980 --> 00:11:40.680
we have still two more assumptions that OLS
makes that are required for the consistency
00:11:40.680 --> 00:11:44.250
and unbiasedness of standard errors,
and the normality of the estimates.
00:11:44.250 --> 00:11:54.675
Standard errors are unbiased and consistent if
the data or the error term is homoskedastic,
00:11:54.675 --> 00:11:55.770
so there is no heteroskedasticity.
00:11:55.770 --> 00:12:03.600
What this assumption means that the observations
are equally spread out around the regression line.
00:12:03.600 --> 00:12:08.730
We would have a heteroskedasticity problem,
if the observations are close
00:12:08.730 --> 00:12:12.570
to the regression line here,
but far from the regression line here.
00:12:12.570 --> 00:12:17.250
So if instead of observing a band of
observations under regression in line,
00:12:17.250 --> 00:12:23.640
we would observe a funnel shape that
opens up or megaphone shape that opens up.
00:12:23.640 --> 00:12:27.780
So that's the homoskedasticity assumption.
00:12:27.780 --> 00:12:32.850
These five assumptions together are known as the
00:12:32.850 --> 00:12:37.770
Gauss-Markov assumptions and OLS is
efficient under these assumptions.
00:12:37.770 --> 00:12:41.370
But more importantly, the homoskedasticity
00:12:41.370 --> 00:12:45.180
assumption is required for the standard
errors to be unbiased and consistent.
00:12:45.180 --> 00:12:52.890
That is important because the t statistic for our
statistical inferences for the p value requires
00:12:52.890 --> 00:12:59.010
that both the estimate and standard error are
consistent and unbiased under those conditions,
00:12:59.010 --> 00:13:02.580
the t value will follow the t distribution with
00:13:02.580 --> 00:13:07.290
a null hypothesis of no effect
holds and we get proper p values.
00:13:07.290 --> 00:13:09.300
So that's the fifth assumption.
00:13:10.050 --> 00:13:15.900
Then the final one is that
most people are probably
00:13:15.900 --> 00:13:19.980
our most aware of is the normality assumption.
00:13:19.980 --> 00:13:22.890
So, this is also misunderstood,
00:13:22.890 --> 00:13:27.300
regression analysis does not assume that any
observed variable is normally distributed.
00:13:27.840 --> 00:13:34.920
Instead, it assumes that error term the
unobservable or how much the observations
00:13:34.920 --> 00:13:38.190
vary around the regression line,
that is normally distributed.
00:13:38.190 --> 00:13:49.290
This our rule is actually, this
rule implies four and five rules.
00:13:49.290 --> 00:13:55.500
And these assumptions one through 1-6 are
called classical linear model assumptions.
00:13:55.500 --> 00:14:01.800
In practice, the normality of the
error term assumption can be ignored,
00:14:01.800 --> 00:14:09.990
because OLS estimator is, what
we say, asymptotically normal.
00:14:09.990 --> 00:14:15.690
So, it means that when a sample
size increases towards infinity,
00:14:15.690 --> 00:14:19.380
then the regression estimates
will be normally distributed,
00:14:19.380 --> 00:14:23.880
regardless of how the error term
is distributed in the population.
00:14:23.880 --> 00:14:30.780
In practice, the sample sizes that
we use, that are 100, or a few 100.
00:14:30.780 --> 00:14:35.370
That is enough for this asymptotic
normality to start to kick in.
00:14:35.370 --> 00:14:36.300
In practice.
00:14:36.900 --> 00:14:42.870
I have tried to demonstrate scenarios where
the lack of normality of the error term would
00:14:42.870 --> 00:14:46.620
be problematic with observations
of 50 or more and I have failed.
00:14:46.620 --> 00:14:51.630
So I cannot think of a scenario
where this normality assumption
00:14:51.630 --> 00:14:54.900
is a practical concern for applied researcher.
00:14:54.900 --> 00:14:57.510
Let's take a summary of the assumptions.
00:14:57.510 --> 00:14:59.490
So we have six assumptions.
00:14:59.490 --> 00:15:02.130
First all relationships are linear.
00:15:02.130 --> 00:15:07.290
That can be checked after the model has been
estimated how we check that I'll cover later,
00:15:07.290 --> 00:15:11.040
then independence of observations,
they must be a random sample.
00:15:11.040 --> 00:15:13.470
This is a feature of your research design.
00:15:13.470 --> 00:15:21.120
And you can check the independence of observations
are after estimation under certain scenarios.
00:15:21.120 --> 00:15:25.530
Perfect collinearity a nonzero
variance of independent variables.
00:15:25.530 --> 00:15:30.540
If that fails, then a regression
model cannot be estimated.
00:15:30.540 --> 00:15:37.860
For example, if you're studying the effects
of gender on performance on statistics course,
00:15:38.700 --> 00:15:40.350
and you only observe women,
00:15:40.350 --> 00:15:45.090
so you have no variation in gender variable,
then you cannot estimate the gender effect.
00:15:45.090 --> 00:15:52.560
Also, if you have two variables that quantify
the exact same thing, then you can't enter
00:15:52.560 --> 00:15:56.130
both into the regression model.
This does not need to be checked,
00:15:56.130 --> 00:16:00.780
because you will know that you can't even,
if you run running a regression analysis,
00:16:00.780 --> 00:16:03.660
you will know if this fails because
the regression doesn't complete.
00:16:03.660 --> 00:16:08.610
Then error term is expected value of zero
given any values of independent variables.
00:16:08.610 --> 00:16:12.360
In practice, this means that all other causes
00:16:12.360 --> 00:16:15.630
of the dependent variable that
are not included in the model,
00:16:15.630 --> 00:16:19.590
must be uncorrelated with all causes
that are included in the model.
00:16:19.590 --> 00:16:21.360
That's a strong assumption,
00:16:21.360 --> 00:16:25.440
it can be tested directly
after least squares estimation,
00:16:25.440 --> 00:16:30.510
but we can test this assumption with instrumental
variables that are covered in a later video.
00:16:30.510 --> 00:16:35.460
Then we have: error term has equal variance
given any values of independent variables.
00:16:35.460 --> 00:16:41.130
This is the no heteroskedasticity assumption,
this should be checked or after estimation,
00:16:41.130 --> 00:16:44.490
because it influences the standard
errors of regression analysis.
00:16:44.490 --> 00:16:48.210
And if you have a heteroskedasticity
problem, it is easy to fix.
00:16:48.210 --> 00:16:53.670
Then error term is normally distributed,
I typically check this because it's useful
00:16:53.670 --> 00:16:58.050
to know if some of the values are far from
the regression line to identify outliers,
00:16:58.050 --> 00:17:01.350
but other than that,
this is not an important assumption.