WEBVTT WEBVTT Kind: captions Language: en 00:00:00.060 --> 00:00:05.520 Regression analysis assumes that the sample  that you're analyzing is a random sample   00:00:05.520 --> 00:00:11.820 from the population. That could be violated for  example if you have 100 observations but those   00:00:11.820 --> 00:00:18.450 observations are measured from five different  people only each of which is measured 20 times. 00:00:18.450 --> 00:00:25.320 What is the impact of non-independent evidence  of observations on regression analysis and what   00:00:25.320 --> 00:00:29.550 kind of problems could that cause for  empirical analysis? Let's take a look. 00:00:29.550 --> 00:00:34.290 So here are the six regression  assumptions according to Bulvits   00:00:34.290 --> 00:00:40.950 and the second assumption is the independence  of observations. So what will happen if the   00:00:40.950 --> 00:00:46.470 observations are not independent and I will  go through this with a couple of examples. 00:00:46.470 --> 00:00:50.190 Let's take a simple example where  we are interested in estimating   00:00:50.190 --> 00:00:56.340 the mean of the population. So our  sample is 100 observations here and   00:00:56.340 --> 00:01:02.190 these 100 observations come from  five clusters. So let's say we are   00:01:02.190 --> 00:01:08.340 observing five companies over 20 years  or we are measuring reaction times from   00:01:08.340 --> 00:01:13.620 five people each measured 20 times and we  want to know what is the population mean. 00:01:13.620 --> 00:01:20.100 If interclass correlation is zero  or there is no dependence between   00:01:20.100 --> 00:01:26.070 the observations within a cluster we get a  very precise estimate of 0.08 for the mean.   00:01:26.070 --> 00:01:31.860 The actual population mean here is one and  the population variance is... A population   00:01:31.860 --> 00:01:36.300 mean is zero and the population variance  is 1. The interclass correlation is 0. 00:01:36.300 --> 00:01:41.970 What will happen if we increase the interclass  correlation? So we make these observations   00:01:41.970 --> 00:01:49.140 that are yellow green and purple we make  them closer to one another. So let's start   00:01:49.140 --> 00:01:54.780 clustering the data. We can see that these  yellow observations start the cluster here   00:01:54.780 --> 00:01:59.790 these purple observations start to go here and  green observations go somewhere in the middle. 00:01:59.790 --> 00:02:07.650 When we increase the interclass correlation of  these data so this is maintaining the variance   00:02:07.650 --> 00:02:12.690 of the data. When we increase the interclass  correlation we can see that the estimate of   00:02:12.690 --> 00:02:21.870 this mean or the sample mean became less and  less accurate estimator of the population mean. 00:02:21.870 --> 00:02:29.580 Originally when we have 100 independent  observation our estimate was 0.08 and after   00:02:29.580 --> 00:02:39.630 we have strongly clustered the data its 0.61.  When the interclass correlation is 1 we have a   00:02:39.630 --> 00:02:48.120 special case where there is no within cluster  varies. So we have 100 observations but only   00:02:48.120 --> 00:02:54.930 5 unique values. So if we only have 5 unique  values then it doesn't make a difference if   00:02:54.930 --> 00:03:02.940 we have each of those 5 values 1000 times or  just once because we gain no new information   00:03:02.940 --> 00:03:07.800 about where the population mean is because  after we have the first observation from a   00:03:07.800 --> 00:03:13.530 cluster then the other observations will bring  no more new information into the analysis. 00:03:13.530 --> 00:03:21.030 So the idea here is that when our data are  independent then each observation brings the same   00:03:21.030 --> 00:03:27.240 amount of new information to the analysis. When  the observations are dependent there is interclass   00:03:27.240 --> 00:03:33.000 correlation then the first observation from a  cluster brings lots of new unique information   00:03:33.000 --> 00:03:41.400 to the analysis but once we have the first  observation then the other observations   00:03:41.400 --> 00:03:46.800 from that the same cluster give you less and  less data about where the population mean is. 00:03:46.800 --> 00:03:53.280 For example if we want to measure what  is the average height of people in the   00:03:53.280 --> 00:03:59.760 University and we have a measurement tape that  contains some measurement errors somehow then   00:03:59.760 --> 00:04:06.390 it's better to measure 100 people than to  measure the same 10 people 10 times. And   00:04:06.390 --> 00:04:13.470 of course if you have no measurement error  then measuring the same 10 people or same   00:04:13.470 --> 00:04:18.570 5 people over and over and over will not  improve the precision of your estimate. 00:04:18.570 --> 00:04:25.110 So then the problem here is that when  interclass correlation increases when   00:04:25.110 --> 00:04:30.390 there is lack of non-independence then  our estimates will be less precise. They   00:04:30.390 --> 00:04:34.020 are still consistent. They are still  unbiased but they are less precise. 00:04:34.020 --> 00:04:39.150 Okay so that's one variable what if we  have two variables and we want to run a   00:04:39.150 --> 00:04:44.700 regression analysis? So we have X and we  have Y. We still have 100 observations   00:04:44.700 --> 00:04:49.110 nested in five clusters so we have  20 observations for each cluster. 00:04:49.110 --> 00:04:53.610 Initially interclass correlation is 0 so  all these observations are independent.   00:04:53.610 --> 00:05:00.510 There is no particular pattern in the colors  and a regression estimates are quite precise.   00:05:00.510 --> 00:05:06.810 They are actually interested in zero.  Our estimate is 0.1. The actual slope   00:05:06.810 --> 00:05:13.140 is 1. Our estimate is 1.07. So it's  pretty close. That's something that   00:05:13.140 --> 00:05:18.330 you can expect from 100 observations with one  explanatory variable in a regression analysis. 00:05:18.330 --> 00:05:24.510 When we increase the interclass correlation of  both these variables we can see again that there's   00:05:24.510 --> 00:05:30.630 some clustering. So these yellow observations  go here and these purple observations go here.   00:05:30.630 --> 00:05:37.200 Green observations go here and ultimately we are  in interclass correlation which is one. We are in   00:05:37.200 --> 00:05:44.220 a scenario where we have just 5 observations that  are repeated and again if we have the same data   00:05:44.220 --> 00:05:49.740 set we just repeat the observations that gives  us no new information for the estimation problem. 00:05:49.740 --> 00:05:57.360 The outcome is that when both of these  variables have clustering effects then our   00:05:57.360 --> 00:06:03.180 regression coefficients both the coefficient  and the slope will be less and less precise.   00:06:03.180 --> 00:06:08.640 They are still consistent and they are still  unbiased but the effect is the same as it was   00:06:08.640 --> 00:06:15.150 for the effect of estimating or in the case  when we estimated the mean from cluster data. 00:06:15.150 --> 00:06:24.720 So in effect interclass correlation decreases  our effective sample size. So if we have 100   00:06:24.720 --> 00:06:30.750 observations that are strongly clustered it's  possible that we actually have only 5 observations   00:06:30.750 --> 00:06:37.980 worth of information. In less extreme cases we  could have something like 100 observations but   00:06:37.980 --> 00:06:43.620 they actually give us information that is  only worth about 20 observations and so on. 00:06:43.620 --> 00:06:51.000 Things get more interesting if we only  have X that is clustered or we only have   00:06:51.000 --> 00:06:55.800 the error term that is clustered but not  the other. So let's take a look at first   00:06:55.800 --> 00:07:01.620 what happens when our X is cluster  but the error terms are independent. 00:07:01.620 --> 00:07:06.930 So we can see that the interclass correlation  again increases this bigger more and more   00:07:06.930 --> 00:07:14.370 clustered until we have just five values. In  this case when X is clustered but their term   00:07:14.370 --> 00:07:19.920 is not the clustering actually doesn't  have an effect. So we can see that the   00:07:19.920 --> 00:07:24.150 regression coefficient and the internal  slope are going to be slightly different   00:07:24.150 --> 00:07:28.890 when the clustering changes but that's just  because when you estimate the same quantity   00:07:28.890 --> 00:07:33.540 from different samples you will get different  results. So there is no systematic effect in   00:07:33.540 --> 00:07:37.560 the estimates getting worse and worse  when interclass correlation increases. 00:07:37.560 --> 00:07:44.010 The reason for this is that regression analysis  actually doesn't make any assumptions about the   00:07:44.010 --> 00:07:47.610 dependent variable. Sorry  the independent variable. 00:07:47.610 --> 00:07:52.470 Everything is estimated conditionally  on the observed value. So we could have   00:07:52.470 --> 00:07:57.690 a researcher that sets these X values.  For example in an experimental context   00:07:57.690 --> 00:08:03.120 we actually set these people into the  treatment group and into the control   00:08:03.120 --> 00:08:07.020 group so those are not random variable there's  something that we said as researchers and we   00:08:07.020 --> 00:08:11.760 could of course set them however we want and  regression analysis would not be affected. 00:08:11.760 --> 00:08:18.660 What if our X is random. X doesn't  or X is not clustered but the error   00:08:18.660 --> 00:08:22.920 term is clustered. This is something  that would be quite an unusual case   00:08:22.920 --> 00:08:25.680 but it's nevertheless useful  to understand what happens. 00:08:25.680 --> 00:08:34.320 So when we cluster the error term we effectively  reduce the variation or the unique values in   00:08:34.320 --> 00:08:41.610 the error term and it has one implication. The  implication is that this intercept is going to   00:08:41.610 --> 00:08:47.190 be estimated less precisely but the slope  estimate is going to stay about the same. 00:08:47.190 --> 00:08:56.250 One way to understand why that is the  case is that these error term values here,   00:08:56.250 --> 00:09:01.410 even if we have just one value for each  cluster, they will give us still very   00:09:01.410 --> 00:09:08.760 useful information about the direction of  the line but not on how high the line is. 00:09:08.760 --> 00:09:14.910 As you can see all these when the errors are  the exact same for each cluster interclass   00:09:14.910 --> 00:09:20.220 correlation is 1 then all of these  clusters form an exact line that   00:09:20.220 --> 00:09:26.100 is parallel to the population regression  line here, but the intercept is estimated   00:09:26.100 --> 00:09:31.020 less efficiently. So this would of  course be a very unusual scenario. 00:09:31.020 --> 00:09:40.410 Typically if you cannot assume that your error  term the unobserved sources of variation in   00:09:40.410 --> 00:09:47.100 the dependent variable are not independent  then typically your explanatory variables   00:09:47.100 --> 00:09:52.620 can be assumed to be independent either.  So we either have the case where the X,   00:09:52.620 --> 00:09:57.630 what error term is independent in which  would be the case in random sampling,   00:09:57.630 --> 00:10:06.600 but X could be non-independent for  example due to manipulation or we   00:10:06.600 --> 00:10:09.930 have the scenario where both of these  variables correlate within clusters. 00:10:09.930 --> 00:10:15.480 So why would this be a problem? Why is  non-independence of observations a problem   00:10:15.480 --> 00:10:23.610 and what is it a problem for or what is the  cause what does it cause? And as we saw non   00:10:23.610 --> 00:10:29.040 independence of observations doesn't lead to  bias. It doesn't lead to inconsistency but it   00:10:29.040 --> 00:10:35.010 leads to our less precise estimates and that  is something that we just can't do anything   00:10:35.010 --> 00:10:39.450 about it. If we don't have much information  then we can't estimate things precisely. 00:10:39.450 --> 00:10:47.250 But that doesn't really... That's not really  a problem because we can just state that well   00:10:47.250 --> 00:10:52.230 we have an estimate but it's not very precise  and sometimes we have to just live with that. 00:10:52.230 --> 00:10:58.650 The real problem is that if we look at the  standard error formula which is derived up   00:10:58.650 --> 00:11:05.130 based on this variance formula where  we just put plug in. There are the   00:11:05.130 --> 00:11:09.720 estimated variance of error term here  for the Sigma and we have sum squared   00:11:09.720 --> 00:11:16.650 total. This equation here only depends  on the variance of the error term. It   00:11:16.650 --> 00:11:21.450 depends on the variance of the predictor  variable and it depends on the sample size. 00:11:21.450 --> 00:11:29.730 If we have clustering effect in the data we  saw that estimates will be less precise even   00:11:29.730 --> 00:11:34.920 if the variance of the term and the variance  of the predictor and the sample size are the   00:11:34.920 --> 00:11:41.670 same. And this equation doesn't take the  clustering into account. So regardless of   00:11:41.670 --> 00:11:47.010 whether we have five observations that are  each is replicated 20 times in our data,   00:11:47.010 --> 00:11:55.470 so effect our sample size is 100 but it seems that  it's 1 but it seems to be what larger or when we   00:11:55.470 --> 00:11:59.640 actually have 100 unique observations,  this formula gives us the same result. 00:11:59.640 --> 00:12:08.040 And the outcome is that when you have clustering  then the standard errors are generally estimated   00:12:08.040 --> 00:12:14.400 inconsistently and they will be negatively  biased. So you will overstate the precision   00:12:14.400 --> 00:12:21.390 of the estimates and that will cause incorrect  inference and particularly it can lead to false   00:12:21.390 --> 00:12:27.630 positive findings rejecting the null hypothesis  where in fact it shouldn't be rejected. 00:12:27.630 --> 00:12:34.920 So what can we do about this problem? There  are a couple of strategies. One is that we   00:12:34.920 --> 00:12:42.960 use a model that specifically includes some terms  in the model, that model the known independence   00:12:42.960 --> 00:12:50.250 of the error term, which can be some quite  difficult to do if the pattern of dependency   00:12:50.250 --> 00:12:56.190 between observations is complex. Another  approach is that we use cluster robust standard   00:12:56.190 --> 00:13:02.220 errors which will allow you to take an arbitrary  correlation structure between observations into   00:13:02.220 --> 00:13:07.770 account and that is a very general strategy  and I will explain that in another video.