WEBVTT WEBVTT Kind: captions Language: en 00:00:00.060 --> 00:00:06.750 Normal regression analysis can be used test  with models with multi-level data. While   00:00:06.750 --> 00:00:11.340 normal regression analysis is not always  the ideal technique for doing so there are   00:00:11.340 --> 00:00:16.350 a couple of simple strategies that can be  applied to estimate models with such data. 00:00:16.350 --> 00:00:21.240 These techniques provide a starting  point for understanding more complex   00:00:21.240 --> 00:00:25.590 analysis techniques. Let's take a  look at how OLS regression can be   00:00:25.590 --> 00:00:28.830 applied to estimate the within  effect from multi-level data. 00:00:28.830 --> 00:00:34.260 Our example data is these 15 companies  observed over 10 years that we have   00:00:34.260 --> 00:00:39.720 looked at in a previous video and we can  see that the within effect and the between   00:00:39.720 --> 00:00:46.320 effect are not the same. So these companies  that are invested only a little in R&D are   00:00:46.320 --> 00:00:49.860 less profitable and these companies  that are investing heavily in R&D. 00:00:49.860 --> 00:00:57.270 Nevertheless the within effect in a company  is negative. So when a company increases their   00:00:57.270 --> 00:01:03.120 R&D investment such as this company here then  profitability will go down. So the between effect   00:01:03.120 --> 00:01:09.180 is positive and the within effect is negative and  we want to understand how to estimate the within   00:01:09.180 --> 00:01:14.730 effect from this data. So we want to take these  two effects apart and estimate the within it. 00:01:14.730 --> 00:01:20.190 The within effect would be important for  example for informing policy on a firm   00:01:20.190 --> 00:01:26.220 level. So should a firm increase or decrease  their R&D investments if they care about their   00:01:26.220 --> 00:01:31.290 profitability? And that is the question that  the within effect could answer in this case. 00:01:31.290 --> 00:01:36.210 We could of course estimate the separate  regression model for each company. So we   00:01:36.210 --> 00:01:41.760 have 15 companies let's split the data to 15  sub samples and run a regression analysis on   00:01:41.760 --> 00:01:46.590 each company which have done here and so the  lines graphically but the problem is that   00:01:46.590 --> 00:01:51.510 then we only have ten observations for each  company which is very small number and also   00:01:51.510 --> 00:01:57.780 we get 15 different regression coefficients  and typically we just want to report one. 00:01:57.780 --> 00:02:04.830 So how do we get the within effect? There  are two very easy strategies for doing so. 00:02:04.830 --> 00:02:09.990 The first strategy is to use  dummy variables. So the idea   00:02:09.990 --> 00:02:15.120 of a dummy variable is that if we  have 15 companies then we create 15   00:02:15.120 --> 00:02:20.790 variables. And those variables indicate  which company that observation is for. 00:02:20.790 --> 00:02:28.230 So we have originally the variable firm which  receives 15 different values and then we create   00:02:28.230 --> 00:02:35.790 50 new variables from 1 to firm 15 and the  first firm one variable resistor value of   00:02:35.790 --> 00:02:42.120 1 for the first firm and value of 0 for the  other firms. There are second dummy variable   00:02:42.120 --> 00:02:49.080 resistor value of 1 for the second case and 0  otherwise. So these dummy variables indicate   00:02:49.590 --> 00:02:58.950 to which firm that observation belongs to. And a  dummy variable is defined in a way that just one   00:02:58.950 --> 00:03:05.160 variable at a time for an observation receives  one and all the others are are zeros here. 00:03:05.160 --> 00:03:11.070 So this indicates that this observation belongs  to firm one and not any other firm. So all these   00:03:11.070 --> 00:03:15.570 are zeros. And how do we apply these in a  regression analysis and what's the outcome? 00:03:15.570 --> 00:03:23.130 When we add all the dummies in a  regression model then typically   00:03:23.130 --> 00:03:28.170 your regression software will drop  one from the model. So here firm one   00:03:28.170 --> 00:03:35.070 has been omitted. The reason is that there  are the dummy variables if you add all of   00:03:35.070 --> 00:03:40.920 those dummies in the model they will be  perfectly collinear with the intercept. 00:03:40.920 --> 00:03:47.850 So in practice we omit typically the first  dummy. So we only have firm two to firm 14   00:03:47.850 --> 00:03:54.720 dummies and then firm one is a reference  category. So the idea is that there are   00:03:54.720 --> 00:04:03.240 profitability of firm 1 when R&D is 0 is given  by the intercept and then firm 2 dummy gives the   00:04:03.240 --> 00:04:11.100 average difference between firm 1 and firm  2 when R&D investments are held constan. 00:04:11.100 --> 00:04:19.260 So these dummies don't indicate any absolute  levels but they indicate the difference between   00:04:19.260 --> 00:04:25.470 the fullback firm, firm two for example,  with the reference category firm one. 00:04:25.470 --> 00:04:30.450 Quite often we wouldn't interpret these  dummies because they are quite a few of   00:04:30.450 --> 00:04:35.880 them and typically we are not interested in  specific cases. We're interested in how the   00:04:35.880 --> 00:04:43.230 regression line goes controlling for the fact that  we have data from multiple different companies. 00:04:43.230 --> 00:04:48.300 So this is the first strategy. We estimate  the dummy variable so we basically allow   00:04:48.300 --> 00:04:55.350 each company to have a specific intercept  that is estimated from the data and then   00:04:55.350 --> 00:04:59.850 these companies have regression lines  with the same slope. So each company   00:04:59.850 --> 00:05:06.510 basically receives the same regression line  except that the intercept can be different. 00:05:06.510 --> 00:05:13.920 So that is one easy strategy model. There  are the differences between these companies. 00:05:13.920 --> 00:05:21.300 The second strategy is within firm  centering. In this strategy we don't   00:05:21.300 --> 00:05:27.240 model the constant differences or stable  differences between companies instead we   00:05:27.240 --> 00:05:32.730 eliminate the differences between the firm's or  companies before the actual regression analysis.   00:05:32.730 --> 00:05:40.110 So what we do is that we take the R&D there  are explanatory variable and profitability   00:05:40.110 --> 00:05:46.590 the dependent variable and we calculate the  cluster mean of both of these variables. So   00:05:46.590 --> 00:05:53.160 we have R&Dm which stands for R&D mean and we  calculate the mean R&D. For the first company   00:05:53.160 --> 00:06:02.820 it's 18% then we calculate the mean R&D for  the second company and it's 6.4 % and so on. 00:06:02.820 --> 00:06:10.950 Then we do the same we center the R&D  by subtracting the cluster mean from   00:06:10.950 --> 00:06:15.990 the original value. So this centered R&D period C   00:06:15.990 --> 00:06:22.470 is how much that observation differs  from the mean value of the company. 00:06:22.470 --> 00:06:32.130 So all these R&D C's sum to 0 within a company.  We do the same for the profitability. So we have   00:06:32.130 --> 00:06:39.210 the the mean profitability and then the mean  central profitability. This eliminates any   00:06:39.210 --> 00:06:44.730 systematic differences between companies  because after the within firm centering   00:06:44.730 --> 00:06:52.020 all variables have means of zeros within  a firm. So the within firm differences   00:06:52.020 --> 00:06:58.320 disappear from the data. Then we run a  regression analysis and we just use the   00:06:58.320 --> 00:07:03.390 mean centered our dependent variable the  mean standard independent variable and we   00:07:03.390 --> 00:07:07.920 get the same regression estimate as  before which is the within effect. 00:07:07.920 --> 00:07:13.350 So this is a regression analysis where  all between effects and all contextual   00:07:13.350 --> 00:07:18.540 effects have been eliminated from the data. What  remains is the within effect which is estimated. 00:07:18.540 --> 00:07:25.560 Let's compare the three models. First we  have a model that ignores clustering. We   00:07:25.560 --> 00:07:31.230 just run a normal regression analysis  of profitability on R&D. Then we have   00:07:31.230 --> 00:07:34.620 the dummy variable model and then we  have the within firm century model. 00:07:34.620 --> 00:07:41.160 We can see that the coefficients here for the  dummy variable model and for the centering   00:07:41.160 --> 00:07:49.050 model are the same so it's minus 0.418 and  this is the within effect. So both of these   00:07:49.050 --> 00:07:55.890 techniques produce the exact same estimate  and that is the estimate of the within effect. 00:07:55.890 --> 00:08:01.200 Then if we ignore clustering we get  the population average effect. So   00:08:01.200 --> 00:08:05.370 the population average effect just gives  us the regression coefficient ignoring   00:08:05.370 --> 00:08:10.020 clustering and it's very difficult to give  any cause on interpretations to that effect. 00:08:10.020 --> 00:08:15.420 The within effect has a costal  interpretation in how much can   00:08:15.420 --> 00:08:20.070 we expect the profitability of one firm  to increase if that firm increases their   00:08:20.070 --> 00:08:26.040 R&D investments by one unit. But there are  some interesting features when we compare   00:08:26.040 --> 00:08:30.180 the dummy variable model the within firms and  particularly the within firm centric model. 00:08:30.180 --> 00:08:35.220 The first is that the score values are  quite different. So for the first model   00:08:35.220 --> 00:08:42.750 it is 31% second model is 70% and the third  model is 20%. So why it's large difference? 00:08:42.750 --> 00:08:52.740 Well this R square here is kind of like it  quantifies how much the within effect and between   00:08:52.740 --> 00:09:00.480 effect together explain the data in sort of a way.  It doesn't really quantify that precisely because   00:09:00.480 --> 00:09:05.340 if the within effect and between effect are not  the same then estimating two different effects   00:09:05.340 --> 00:09:12.180 will give you a higher R square. But it's roughly.  So how much R&D generally explains profitability. 00:09:12.180 --> 00:09:17.970 Then we have the 70% variation here in  the dummy variables. So what is this   00:09:17.970 --> 00:09:26.310 70% R square. It quantifies how much  the unobserved the generated term how   00:09:26.310 --> 00:09:30.330 much the contextual effect and how much the  weeding effect together explain the data. 00:09:30.330 --> 00:09:36.810 So if we eliminate all those three sources  of variance in the data there is still 30%   00:09:36.810 --> 00:09:42.510 of the variation that is unexplained.  Then the within firm centering gives   00:09:42.510 --> 00:09:53.610 us 20% R square and this is roughly how  much R&D explained within firm varies.   00:09:53.610 --> 00:09:59.970 So if we want to understand how much R&D  investment influences the variation of an   00:09:59.970 --> 00:10:06.420 individual company's performance then this  R square of 20% would answer that question. 00:10:06.420 --> 00:10:12.990 So which one should you report? It's  something that you should really   00:10:12.990 --> 00:10:17.790 understand why these are different but if  you don't know which one you should report   00:10:17.790 --> 00:10:23.550 typically these within firm centering R  square is something that is most useful   00:10:23.550 --> 00:10:29.160 because it is a clear interpretation of R  square of a particular effect: how much R&D   00:10:29.160 --> 00:10:35.310 influences variation of company performance  within that firm whereas the dummy variable   00:10:35.310 --> 00:10:42.000 and ignore clustering R squares they combine  explanation on at least two different levels. 00:10:42.000 --> 00:10:48.540 Then there is another interesting feature.  It's that while these estimates from the   00:10:48.540 --> 00:10:55.050 dummy variable model and within firm centering  are exactly the same the standard errors are not   00:10:55.050 --> 00:11:03.720 the same. So what does that mean? Standard error  quantifies how much we expect the coefficient to   00:11:03.720 --> 00:11:10.770 vary if we repeat the same analysis over and over  from repeated samples of the same population. 00:11:10.770 --> 00:11:16.080 The dummy variable model and the within  centric model have been proven to produce   00:11:16.080 --> 00:11:20.670 the same results. So their variation the  real variation from one sample to another   00:11:20.670 --> 00:11:27.600 should be exactly or is exactly the same.  So how come standard errors be different?   00:11:27.600 --> 00:11:33.420 And if the variation of this dummy variable  coefficient and this within firm coefficient   00:11:33.420 --> 00:11:39.540 is actually the same then one of these  standard errors must be incorrect because   00:11:39.540 --> 00:11:46.890 they quantify both the same variation in the  hypothetical scenario of repeated analysis. 00:11:46.890 --> 00:11:52.740 It turns out that this within firms  centering standard error is actually   00:11:52.740 --> 00:12:00.480 biased and inconsistent. So this underestimates  the variability of the regression coefficient.   00:12:00.480 --> 00:12:07.320 The reason is that when within firm center  we also take out some variation of the error   00:12:07.320 --> 00:12:11.100 term and a variation of the error term  is used to estimate the standard error. 00:12:11.100 --> 00:12:16.080 So the within firm centering strategy  should actually never be applied in   00:12:16.080 --> 00:12:20.040 practice to the dependent variable  because the standard errors will   00:12:20.040 --> 00:12:24.060 be inconsistent. If you do so  you have to apply a correction. 00:12:24.060 --> 00:12:29.940 There are analysis techniques such  as generalized least squares that   00:12:29.940 --> 00:12:34.800 do this kind of centering but those  techniques also apply the correction   00:12:34.800 --> 00:12:38.100 to the standard errors. So if you  want to centrally dependent variable   00:12:38.100 --> 00:12:43.770 you should always do so by using one of the  can procedures of your statistical software. 00:12:43.770 --> 00:12:50.670 So these are two simple strategies and well  there is a third simple strategy run a separate   00:12:50.670 --> 00:12:57.060 regression analysis for each company but then that  run has the problem that you have a large number   00:12:57.060 --> 00:13:04.080 of models and with very small sample sizes  each and how would you aggregate the results   00:13:04.080 --> 00:13:08.040 for interpretation. So this is typically  not something that people would consider. 00:13:08.040 --> 00:13:13.170 The dummy variable regression is actually a useful  technique if you have a small small number of   00:13:13.170 --> 00:13:19.470 cases. The problem with that is that R square is  difficult to interpret and the central technique   00:13:19.470 --> 00:13:24.420 is something that you should not use at least  you should never standard independent variable. 00:13:24.420 --> 00:13:31.680 So how should you actually model this data? The  dummy variables are okay but there are also other   00:13:31.680 --> 00:13:36.990 techniques. So the more advanced techniques for  multi-level modeling, and these are actually more   00:13:36.990 --> 00:13:41.400 commonly used techniques for multi-level  data than the normal regression analysis,   00:13:41.400 --> 00:13:48.090 can be categorized in on based on one  assumption. So if you can assume that   00:13:48.090 --> 00:13:54.660 there are no contextual effects of the variables  of interest econometricians say that the random   00:13:54.660 --> 00:13:59.760 effects assumption holds, I have another video  about that assumption, then you can apply some   00:13:59.760 --> 00:14:04.950 of these techniques. You can apply generalized  least squares random effects estimation maximum   00:14:04.950 --> 00:14:10.920 likelihood estimation of random intercept models  or you can apply generalized estimation equation   00:14:10.920 --> 00:14:15.840 technique or you can apply normal regression  analysis with cluster robust standard errors. 00:14:15.840 --> 00:14:24.690 If you cannot assume that the contextual effects  are zero. If you know or you you have an idea that   00:14:24.690 --> 00:14:30.570 they may be non zero then you can use generalized  least squares fixed effect regression analysis   00:14:30.570 --> 00:14:38.490 or alternatively you can use any of these  analysis techniques and then use cluster   00:14:38.490 --> 00:14:45.810 means of the interesting variables as controls.  So recall that cluster means where the means of   00:14:45.810 --> 00:14:52.020 the variables within clusters that you calculate  when you do the cluster mean centering procedure.