WEBVTT 00:00:00.180 --> 00:00:06.270 This video will introduce you the  regression analysis assumptions,   00:00:06.270 --> 00:00:12.030 or specifically, the assumptions that  least squares estimation principle assumes. 00:00:12.030 --> 00:00:15.150 So the idea of least square estimation   00:00:15.150 --> 00:00:19.860 or regression model is that we  have one dependent variable y. 00:00:19.860 --> 00:00:23.250 And in this example, we have  one independent variable x. 00:00:23.250 --> 00:00:28.140 And we draw a line through the middle of  the data the scatter plot of the data. 00:00:28.140 --> 00:00:30.930 And regression analysis assumes that   00:00:30.930 --> 00:00:34.770 these observations are equally  spread out around this line. 00:00:34.770 --> 00:00:37.500 So that the dispersion of observations the   00:00:37.500 --> 00:00:42.120 same here, as is the dispersion of  observations here are on the line. 00:00:42.120 --> 00:00:46.590 So they each individual observation in  our data, falls somewhere on this line,   00:00:46.590 --> 00:00:50.730 some go exactly the line some  go a bit further from the line. 00:00:50.730 --> 00:00:54.630 We also assume that our,  when we know that x is one,   00:00:54.630 --> 00:01:00.420 then the values of y are normally  distributed here on the regression line. 00:01:00.420 --> 00:01:04.050 So that's basically a summary of the assumptions. 00:01:04.050 --> 00:01:09.330 And now we will take a look at  specific parts of those assumptions. 00:01:09.330 --> 00:01:13.650 Before we do so we have to talk a  bit about what the assumptions mean,  00:01:13.650 --> 00:01:14.970 because there are some misconceptions. 00:01:14.970 --> 00:01:19.920 For example, sometimes students in my classes say   00:01:19.920 --> 00:01:24.030 that an estimation technique requires  that the data normally distributed, 00:01:24.030 --> 00:01:28.980 and they think it implies that an  estimation technique can be applied   00:01:28.980 --> 00:01:33.390 when the data are not normal. That has two problems. 00:01:33.390 --> 00:01:37.350 First of all, we rarely make assumptions  about the distribution of observed data. 00:01:37.350 --> 00:01:44.850 And second, the fact that an assumption doesn't   00:01:44.850 --> 00:01:49.230 hold exactly doesn't mean that the  estimator is immediately useless. 00:01:49.230 --> 00:01:52.800 Let's start with examples  of models and estimators. 00:01:52.800 --> 00:01:54.720 So we understand what assumptions mean. 00:01:54.720 --> 00:01:56.520 So here's the regression model. 00:01:56.520 --> 00:02:01.200 It's that y is a weighted sum of x's  that observe independent variables,   00:02:01.200 --> 00:02:04.470 plus some error term u that  the model doesn't explain. 00:02:04.470 --> 00:02:07.890 Then we have estimates and principles,   00:02:07.890 --> 00:02:11.040 how do we choose the betas, which set of betas is the best. 00:02:11.040 --> 00:02:17.490 And one good rule is the OLS rule,  minimize the sum of squared residuals. 00:02:17.490 --> 00:02:20.010 So we choose the betas here. 00:02:20.010 --> 00:02:22.140 So that the sum of squared residuals,   00:02:22.140 --> 00:02:28.200 what is the difference between the observed  value y and the fitted value from the betas, 00:02:28.200 --> 00:02:33.810 is as small as possible so that that's  what we are me discussing this part. 00:02:33.810 --> 00:02:36.570 But that's not the only way of  estimating a regression model. 00:02:36.570 --> 00:02:40.590 For example, we could use weighted least squares. 00:02:40.590 --> 00:02:44.580 So weighted least squares is the same as OLS,  00:02:44.580 --> 00:02:48.360 except that instead of minimizing  a sum of squared residuals,  00:02:48.360 --> 00:02:54.000 we minimize the weighted sum of squared  residuals or sum of weighted squared residuals. 00:02:54.000 --> 00:02:59.430 The idea of weighted least squares  is that some observations provide   00:02:59.430 --> 00:03:03.210 us more information about the  regression line goes than others. 00:03:03.210 --> 00:03:08.580 And in some scenarios, the weighted  least squares is better than OLS. 00:03:08.580 --> 00:03:10.830 To understand what those scenarios are,   00:03:10.830 --> 00:03:14.730 we have to understand the assumptions that  but that's not all, we have also others. 00:03:14.730 --> 00:03:19.020 So there's feasible generalized least squares,   00:03:19.020 --> 00:03:24.840 which is the same as weighted least squares  that estimates the weights from the data. 00:03:24.840 --> 00:03:29.010 So that makes a bit less assumptions that weighted  least squares and there are trade offs in that. 00:03:29.010 --> 00:03:34.140 We have also our interative  weighted least squares or IRLS. 00:03:34.140 --> 00:03:39.300 The idea of IRLS is that it  weights, the residuals interatively. 00:03:39.300 --> 00:03:45.000 And the weights from for the next interation  are based on the previous interation. 00:03:45.000 --> 00:03:50.220 And this is a good technique when you have outlier  observations that I talk about in another video. 00:03:50.220 --> 00:03:56.250 So all of these techniques can be used in  different scenarios, they all work reasonably   00:03:56.250 --> 00:04:01.800 well, in some conditions, in some conditions one  of these rules is clearly better than others. 00:04:01.800 --> 00:04:04.440 To understand that we have to  understand the assumptions. 00:04:04.440 --> 00:04:09.360 Also the models, we can use different models. 00:04:09.360 --> 00:04:13.860 So it's, the regression model is  not necessarily the best model.   00:04:13.860 --> 00:04:20.460 For example, we could instead of regression model, we could apply a generalized linear model which   00:04:20.460 --> 00:04:25.050 takes the fitted values for regression  analysis applies a function there. 00:04:25.050 --> 00:04:29.220 And then it doesn't make the assumption  that observations are normally distributed.  00:04:29.220 --> 00:04:32.310 So that's, that's one alternative model. 00:04:32.310 --> 00:04:37.410 So you can choose either alternative model  or alternative estimator, when your data   00:04:37.410 --> 00:04:42.750 doesn't really fit into the model estimates  combination that you're you're planning to use. 00:04:42.750 --> 00:04:45.630 Here's another one, this is a multi level model. 00:04:45.630 --> 00:04:50.568 And this would be applicable when you  have for example, longitudinal data. 00:04:50.568 --> 00:04:53.010 So you have multiple  observations for each company,  00:04:53.010 --> 00:04:56.430 and many companies in the data  and you assume that there are   00:04:56.430 --> 00:05:00.480 some constant differences between  companies that persist over time. 00:05:00.480 --> 00:05:03.780 And then you would use that  kind of model because you are   00:05:03.780 --> 00:05:08.340 in violation of the random sampling  assumption in regression analysis. 00:05:08.340 --> 00:05:12.270 So there are different  things that that you can use,  00:05:12.270 --> 00:05:18.330 I recommend always as default option to go  with regression analysis and OLS estimation,  00:05:18.330 --> 00:05:22.080 if you have a good reason to use something else, then do that. 00:05:22.080 --> 00:05:27.990 But start with OLS and regression model, because it will tell you something about the   00:05:27.990 --> 00:05:31.920 data that you didn't know before estimation. And it's quick to calculate. 00:05:31.920 --> 00:05:34.230 Then you go to more complicated things,  00:05:34.230 --> 00:05:41.220 if specific assumptions of OLS don't  really fit into your research scenario. 00:05:41.220 --> 00:05:48.720 Okay, so the assumptions are something that we do  so assumptions are required for certain proofs. 00:05:48.720 --> 00:05:55.710 So, when we say that the OLS requires that  the error term is normally distributed,  00:05:55.710 --> 00:06:01.350 it means that it has been proven  that OLS is consistent, unbiased,   00:06:01.350 --> 00:06:03.390 efficient, and the estimates are normal,  00:06:03.390 --> 00:06:06.390 when among other assumptions, the  error term is normally distributed. 00:06:06.390 --> 00:06:10.140 So certain proofs require these assumptions. 00:06:10.140 --> 00:06:15.540 If we can't assume certain proofs, certain  things, then the proof can't be done. 00:06:15.540 --> 00:06:18.150 So, if the error term is not normally distributed,   00:06:18.150 --> 00:06:26.670 then we cannot prove that the OLS  estimator is on unbiased in small samples. 00:06:26.670 --> 00:06:28.800 It could be but we can't prove it. 00:06:28.800 --> 00:06:36.660 So these assumptions imply one important  thing and they don't imply another thing. 00:06:36.660 --> 00:06:44.640 So what they do imply is that the estimator is  useful when we are close to this ideal conditions. 00:06:44.640 --> 00:06:51.360 So regression analysis assumes that the  relationships in the data are linear,  00:06:51.360 --> 00:06:55.290 if they are close to linear,  but not exactly linear,  00:06:55.290 --> 00:06:57.660 regression analysis will be a useful tool. 00:06:57.660 --> 00:07:00.030 So these assumptions don't have to hold exactly. 00:07:00.030 --> 00:07:05.820 If they are close enough, then  we will get still good results. 00:07:05.820 --> 00:07:13.650 Also, it doesn't imply that if an estimator has  been proven to be consistent under some scenario,  00:07:13.650 --> 00:07:17.880 then it's immediately useless in other scenarios. 00:07:17.880 --> 00:07:20.820 So the fact that something has been proven in   00:07:20.820 --> 00:07:25.350 one condition doesn't mean that it  can not work in another condition. 00:07:25.350 --> 00:07:29.700 But it's important to understand the  limitations of these different techniques. 00:07:29.700 --> 00:07:36.720 And for that, we test the assumptions  typically after we do our analysis. 00:07:36.720 --> 00:07:42.450 Now that we have understood that the assumptions  are something that should ideally hold,  00:07:42.450 --> 00:07:46.500 but in practice, they hold only approximately. 00:07:46.500 --> 00:07:51.660 And also we have understood that  because we are in violation of, 00:07:51.660 --> 00:07:55.440 for example, the normality of  assumption in regression analysis,   00:07:55.440 --> 00:07:58.140 it doesn't necessarily have  any severe consequences, 00:07:58.140 --> 00:08:00.360 it just means that certain things can be proven, 00:08:00.360 --> 00:08:03.630 the thing that we can't prove could still be true. 00:08:03.630 --> 00:08:06.300 Let's take a look at there are actual assumptions. 00:08:06.300 --> 00:08:14.220 Regression analysis requires four  assumptions to provide or OLS estimates   00:08:14.220 --> 00:08:18.480 requires four assumptions to provide  you consistent and unbiased estimates. 00:08:18.480 --> 00:08:22.050 And the unbiasedness property  here refers to any sample size. 00:08:22.050 --> 00:08:25.860 So regression analysis is unbiased  regardless of the sample size. 00:08:25.860 --> 00:08:29.070 You can get unbiased estimates, with sample of 10 observations.   00:08:29.070 --> 00:08:32.490 The estimates will be very precise,  but they're still unbiased. 00:08:32.490 --> 00:08:35.790 The first assumption is  that we have a linear model. 00:08:35.790 --> 00:08:38.820 So that assumption basically  just defines the model. 00:08:38.820 --> 00:08:41.700 And that's, that's all there is to it. 00:08:41.700 --> 00:08:44.550 Then the second assumption is random sampling. 00:08:44.550 --> 00:08:47.790 So random sampling means that  all observations are independent. 00:08:47.790 --> 00:08:55.230 And each observation in the population has equal  probability in getting selected to the sample. 00:08:55.230 --> 00:08:57.720 This is a feature of your research design. 00:08:57.720 --> 00:09:00.930 And it can't really be  tested empirically directly,  00:09:01.500 --> 00:09:06.090 you can test it in some aspects  of this random sampling. 00:09:06.090 --> 00:09:07.950 And I will talk about that later. 00:09:07.950 --> 00:09:11.970 Then we have two other assumptions. 00:09:11.970 --> 00:09:15.315 Assumption three is, there's  no perfect collinearity. 00:09:15.315 --> 00:09:18.360 It's so perfect collinearity is  different from multicollinearity. 00:09:18.360 --> 00:09:25.140 Perfect collinearity means that if  that one or more of the variables,   00:09:25.140 --> 00:09:30.600 independent variables in the model are completely  determined by another independent variable. 00:09:30.600 --> 00:09:38.070 So for example, if we have three dummy variables,  then we that define a categorical variable. 00:09:38.070 --> 00:09:42.210 If we know two values for the dummies, then we can infer the third. 00:09:42.210 --> 00:09:48.210 That assumption requires that every  new observed new variable that we   00:09:48.210 --> 00:09:51.750 enter that the model, brings new  information about the phenomena. 00:09:51.750 --> 00:09:56.040 If we know that, let's use gender as an example. 00:09:56.040 --> 00:09:59.580 We only need to know whether  a person is or is not a male. 00:09:59.580 --> 00:10:06.240 If he is not a male, then we know that he's  a female, or she's a female, then having   00:10:06.240 --> 00:10:11.010 a variable for male or having a variable  for female, would be perfectly collinear. 00:10:11.010 --> 00:10:15.480 Because knowing whether a person  is a man automatically tells you   00:10:15.480 --> 00:10:17.760 whether the same person is a woman or not. 00:10:17.760 --> 00:10:20.040 So that's the perfect collinearity. 00:10:20.040 --> 00:10:26.310 The zero conditional mean, this is  a technical way of expressing it,  00:10:26.310 --> 00:10:29.580 but it basically tells you that we assume that   00:10:29.580 --> 00:10:34.830 the error term is uncorrelated  with all explanatory variables. 00:10:34.830 --> 00:10:40.950 And this is a bit more complicated assumption  that I'll explained in another video,  00:10:40.950 --> 00:10:45.210 but this is also referred to as  the no endogeneity assumption. 00:10:45.210 --> 00:10:49.860 And if we look at this diagram  of regression analysis,  00:10:49.860 --> 00:10:59.640 then this assumption number four can be understood  as that where this distribution is located,  00:10:59.640 --> 00:11:03.330 doesn't depend on the regression line. 00:11:03.330 --> 00:11:09.120 So the distribution is always exactly at  the regression line, instead of for example,  00:11:09.120 --> 00:11:14.070 the line going here, and the observations  being somewhere here normally distributed. 00:11:14.070 --> 00:11:19.470 So that is called the no endogeneity  assumption and endogeneity is a big issue. 00:11:19.470 --> 00:11:23.730 If we want to make causal  claims using observational data,  00:11:23.730 --> 00:11:26.100 I'll return to that in another video. 00:11:26.100 --> 00:11:31.980 So under these four assumptions,  OLS, is unbiased and consistent,   00:11:31.980 --> 00:11:40.680 we have still two more assumptions that OLS  makes that are required for the consistency   00:11:40.680 --> 00:11:44.250 and unbiasedness of standard errors,  and the normality of the estimates. 00:11:44.250 --> 00:11:54.675 Standard errors are unbiased and consistent if  the data or the error term is homoskedastic, 00:11:54.675 --> 00:11:55.770 so there is no heteroskedasticity. 00:11:55.770 --> 00:12:03.600 What this assumption means that the observations  are equally spread out around the regression line. 00:12:03.600 --> 00:12:08.730 We would have a heteroskedasticity problem, if the observations are close   00:12:08.730 --> 00:12:12.570 to the regression line here, but far from the regression line here. 00:12:12.570 --> 00:12:17.250 So if instead of observing a band of  observations under regression in line,   00:12:17.250 --> 00:12:23.640 we would observe a funnel shape that  opens up or megaphone shape that opens up. 00:12:23.640 --> 00:12:27.780 So that's the homoskedasticity assumption. 00:12:27.780 --> 00:12:32.850 These five assumptions together are known as the   00:12:32.850 --> 00:12:37.770 Gauss-Markov assumptions and OLS is  efficient under these assumptions. 00:12:37.770 --> 00:12:41.370 But more importantly, the homoskedasticity   00:12:41.370 --> 00:12:45.180 assumption is required for the standard  errors to be unbiased and consistent. 00:12:45.180 --> 00:12:52.890 That is important because the t statistic for our  statistical inferences for the p value requires   00:12:52.890 --> 00:12:59.010 that both the estimate and standard error are  consistent and unbiased under those conditions, 00:12:59.010 --> 00:13:02.580 the t value will follow the t distribution with   00:13:02.580 --> 00:13:07.290 a null hypothesis of no effect  holds and we get proper p values. 00:13:07.290 --> 00:13:09.300 So that's the fifth assumption. 00:13:10.050 --> 00:13:15.900 Then the final one is that  most people are probably   00:13:15.900 --> 00:13:19.980 our most aware of is the normality assumption. 00:13:19.980 --> 00:13:22.890 So, this is also misunderstood,   00:13:22.890 --> 00:13:27.300 regression analysis does not assume that any  observed variable is normally distributed. 00:13:27.840 --> 00:13:34.920 Instead, it assumes that error term the  unobservable or how much the observations   00:13:34.920 --> 00:13:38.190 vary around the regression line, that is normally distributed. 00:13:38.190 --> 00:13:49.290 This our rule is actually, this  rule implies four and five rules. 00:13:49.290 --> 00:13:55.500 And these assumptions one through 1-6 are  called classical linear model assumptions. 00:13:55.500 --> 00:14:01.800 In practice, the normality of the  error term assumption can be ignored, 00:14:01.800 --> 00:14:09.990 because OLS estimator is, what  we say, asymptotically normal. 00:14:09.990 --> 00:14:15.690 So, it means that when a sample  size increases towards infinity,   00:14:15.690 --> 00:14:19.380 then the regression estimates  will be normally distributed, 00:14:19.380 --> 00:14:23.880 regardless of how the error term  is distributed in the population. 00:14:23.880 --> 00:14:30.780 In practice, the sample sizes that  we use, that are 100, or a few 100. 00:14:30.780 --> 00:14:35.370 That is enough for this asymptotic  normality to start to kick in. 00:14:35.370 --> 00:14:36.300 In practice. 00:14:36.900 --> 00:14:42.870 I have tried to demonstrate scenarios where  the lack of normality of the error term would   00:14:42.870 --> 00:14:46.620 be problematic with observations  of 50 or more and I have failed. 00:14:46.620 --> 00:14:51.630 So I cannot think of a scenario  where this normality assumption   00:14:51.630 --> 00:14:54.900 is a practical concern for applied researcher. 00:14:54.900 --> 00:14:57.510 Let's take a summary of the assumptions. 00:14:57.510 --> 00:14:59.490 So we have six assumptions. 00:14:59.490 --> 00:15:02.130 First all relationships are linear. 00:15:02.130 --> 00:15:07.290 That can be checked after the model has been  estimated how we check that I'll cover later,  00:15:07.290 --> 00:15:11.040 then independence of observations, they must be a random sample. 00:15:11.040 --> 00:15:13.470 This is a feature of your research design. 00:15:13.470 --> 00:15:21.120 And you can check the independence of observations  are after estimation under certain scenarios. 00:15:21.120 --> 00:15:25.530 Perfect collinearity a nonzero  variance of independent variables. 00:15:25.530 --> 00:15:30.540 If that fails, then a regression  model cannot be estimated. 00:15:30.540 --> 00:15:37.860 For example, if you're studying the effects  of gender on performance on statistics course,  00:15:38.700 --> 00:15:40.350 and you only observe women,  00:15:40.350 --> 00:15:45.090 so you have no variation in gender variable, then you cannot estimate the gender effect. 00:15:45.090 --> 00:15:52.560 Also, if you have two variables that quantify  the exact same thing, then you can't enter   00:15:52.560 --> 00:15:56.130 both into the regression model. This does not need to be checked,  00:15:56.130 --> 00:16:00.780 because you will know that you can't even, if you run running a regression analysis,  00:16:00.780 --> 00:16:03.660 you will know if this fails because  the regression doesn't complete. 00:16:03.660 --> 00:16:08.610 Then error term is expected value of zero  given any values of independent variables. 00:16:08.610 --> 00:16:12.360 In practice, this means that all other causes   00:16:12.360 --> 00:16:15.630 of the dependent variable that  are not included in the model,  00:16:15.630 --> 00:16:19.590 must be uncorrelated with all causes  that are included in the model. 00:16:19.590 --> 00:16:21.360 That's a strong assumption,  00:16:21.360 --> 00:16:25.440 it can be tested directly  after least squares estimation,  00:16:25.440 --> 00:16:30.510 but we can test this assumption with instrumental  variables that are covered in a later video. 00:16:30.510 --> 00:16:35.460 Then we have: error term has equal variance  given any values of independent variables. 00:16:35.460 --> 00:16:41.130 This is the no heteroskedasticity assumption, this should be checked or after estimation,  00:16:41.130 --> 00:16:44.490 because it influences the standard  errors of regression analysis. 00:16:44.490 --> 00:16:48.210 And if you have a heteroskedasticity  problem, it is easy to fix. 00:16:48.210 --> 00:16:53.670 Then error term is normally distributed, I typically check this because it's useful   00:16:53.670 --> 00:16:58.050 to know if some of the values are far from  the regression line to identify outliers,  00:16:58.050 --> 00:17:01.350 but other than that, this is not an important assumption.