WEBVTT 00:00:00.090 --> 00:00:05.190 Multicollinearity is another commonly  misunderstood feature of regression analysis. 00:00:05.190 --> 00:00:09.900 Multicollinearity refers to a scenario where  the independent variables are highly correlated. 00:00:09.900 --> 00:00:16.350 It is quite common to see studies do  diagnostics to detect multicollinearity. 00:00:16.350 --> 00:00:19.410 And then drop some variables from the model based   00:00:19.410 --> 00:00:25.200 on some statistics that indicate that  multicollinearity could be a problem. 00:00:25.200 --> 00:00:30.210 There are quite a lot of difficulties  or problems with that approach.   00:00:30.210 --> 00:00:36.780 Let's take a look at the Hekman's paper. So they identified that the customer race   00:00:36.780 --> 00:00:42.600 and customer gender were highly correlated  with physician race and phycisian gender 00:00:42.600 --> 00:00:46.920 And therefore they decided to drop  customers gender and customer race   00:00:46.920 --> 00:00:51.360 from the data because that caused  the multicollinearity situation.  00:00:51.360 --> 00:00:54.750 Because these variables were  correlated with more than 0.9. 00:00:54.750 --> 00:01:00.930 So what is this issue about and why  would one like to drop variables. 00:01:00.930 --> 00:01:08.190 Multicollinearity relates to the  sampling variance of the OLS estimate.  00:01:08.190 --> 00:01:11.970 Or generally any estimator  that estimates linear model. 00:01:11.970 --> 00:01:17.610 So to understand multicollinearity let's take  a look at the variance of the OLS estimates. 00:01:18.150 --> 00:01:22.440 The variance of the OLS estimates  is given by that kind of equation   00:01:22.440 --> 00:01:26.730 here and this equation is also used  for estimating the standard errors. 00:01:28.350 --> 00:01:33.990 This equation tells us that the  variance of estimates depends on   00:01:33.990 --> 00:01:40.470 how well the other independent variables  explain the focal independent variable.  00:01:40.470 --> 00:01:43.920 Whose coefficients variance we are interested in. 00:01:43.920 --> 00:01:54.630 So when this r square here goes up then the  variance of the correlation coefficient increases. 00:01:54.630 --> 00:01:59.820 The reason is that when this R  square goes up then 1- R square   00:01:59.820 --> 00:02:05.610 approaches 0 and then when you multiply  something with something that produce 0. 00:02:05.610 --> 00:02:10.500 Then the multiplication, the result  will approach 0 as well and when you   00:02:10.500 --> 00:02:16.800 divide something by something that for  zero then you will get a large number. 00:02:16.800 --> 00:02:23.130 So the square increases, when  each individual or the focal   00:02:23.130 --> 00:02:26.670 variable is increasingly redundant in the model. 00:02:26.670 --> 00:02:29.670 It provides the same information  as the other variables. 00:02:29.670 --> 00:02:32.520 Then the standard error will increase. 00:02:33.510 --> 00:02:36.510 The R square, when our variables are more   00:02:36.510 --> 00:02:43.470 correlated then the estimates will  be less efficient and less precise. 00:02:43.470 --> 00:02:46.470 And also the standard error will be larger because   00:02:46.470 --> 00:02:49.530 standard error estimates the  precision of the estimates. 00:02:49.530 --> 00:02:53.040 So is that the problem? Well that depends. 00:02:53.040 --> 00:02:57.750 Let's take an example, what will  happen when we have two highly   00:02:57.750 --> 00:03:01.830 correlated variables and what it means  for the regression analysis results. 00:03:01.830 --> 00:03:05.220 So we should expect when two  variables are highly correlated   00:03:05.220 --> 00:03:08.010 the regression results to be very imprecise. 00:03:08.730 --> 00:03:13.530 That if we repeat the study over and  over many times the dispersion of the   00:03:13.530 --> 00:03:16.200 estimates over multiple repeated samples is large. 00:03:18.120 --> 00:03:27.390 Here we have the correlation between x and y at  0.9, which is modeled based on Hekman's paper. 00:03:27.390 --> 00:03:35.370 Let's assume that the correlation between x  and y varies between 0.43 and 0.52 so this   00:03:35.370 --> 00:03:41.250 is the variation or this kind of dispersion  could easily be a result of a small sample. 00:03:41.250 --> 00:03:46.500 Let's assume that this is the  0.475 is the population value   00:03:46.500 --> 00:03:49.680 with the sample size of for example 100. 00:03:49.680 --> 00:03:53.610 It is very easy to get a sample correlation 0.43. 00:03:53.610 --> 00:03:58.170 Then we have correlation between x 2 and y model   00:03:58.170 --> 00:04:04.110 the same way and we have five  combinations of correlations. 00:04:04.110 --> 00:04:06.150 These correlations vary a little. 00:04:06.150 --> 00:04:10.920 Correlation between x and y very little,  correlation between X 2 and Y vary a little. 00:04:10.920 --> 00:04:16.620 Because x1 and x2 are correlated when  we calculate the regression model using   00:04:16.620 --> 00:04:20.460 these correlations the regression  estimates actually vary widely. 00:04:20.460 --> 00:04:27.240 So in this model the regression  coefficient is -0.2 and here it's +0,7.   00:04:28.200 --> 00:04:30.060 We have even the sign that is flipping. 00:04:30.060 --> 00:04:35.490 Now the multicollinearity problem  relates to the fact that because   00:04:35.490 --> 00:04:43.260 x1 and x2 are so highly correlated then it is  very difficult to get the unique effect of x1. 00:04:43.260 --> 00:04:50.100 Because the changes in x1 are  always accompanied by changes x2.  00:04:50.100 --> 00:04:56.100 So we don't know which one it is, considered  company's size and companies sizing in revenue   00:04:56.100 --> 00:05:01.080 and sizing personnel those are highly correlated  note that 0.9 but still highly correlated. 00:05:01.080 --> 00:05:06.150 So it's difficult to say wether for example  you know investment decisions depend more   00:05:06.150 --> 00:05:11.370 on the number of people or the revenues of  the company just based on statistical means. 00:05:11.370 --> 00:05:14.010 So what's the problem? 00:05:14.010 --> 00:05:23.220 The problem here is that if we want to say  that this effect of beta 1 is 0.25 and not 0   00:05:23.220 --> 00:05:27.000 we have to be able to differentiate  between these two correlations. 00:05:28.170 --> 00:05:33.540 And how much is the sample, how  much sample size would we require   00:05:33.540 --> 00:05:41.310 to say for sure that correlation  0.475 instead of 0.45 or 0.5. 00:05:41.310 --> 00:05:44.280 We have to understand the sampling  variation of a correlation.  00:05:44.280 --> 00:05:51.300 So the standard error, standard  deviation of correlation of 0.475   00:05:51.300 --> 00:05:55.410 with different sample sizes 100 its 0.05. 00:05:55.410 --> 00:06:01.950 So if our sample size is 100 then we  can easily get something like 0.40   00:06:01.950 --> 00:06:07.020 or 0.5 too which are less than one  standard deviation from this mean. 00:06:07.740 --> 00:06:12.000 We can easily get these are kind  of correlations with sample of 100. 00:06:12.000 --> 00:06:17.070 So when our sample size is 100  and x and y are correlated 0.9. 00:06:17.070 --> 00:06:23.310 We really cannot say which one of  these coefficients is the correct set.  00:06:23.310 --> 00:06:27.900 Because our sample size doesn't  allow us enough precision to say   00:06:27.900 --> 00:06:31.950 which of these correlations are  the true population correlations.  00:06:31.950 --> 00:06:35.880 That determine the population regression  coefficients that we are interested in. 00:06:35.880 --> 00:06:44.250 So the fact that these two variables  are highly it kind of amplifies the   00:06:44.250 --> 00:06:48.360 effect of sampling variation of this correlation. 00:06:48.360 --> 00:06:52.980 The sampling variation of correlations  here is small but because x and y are   00:06:52.980 --> 00:06:58.050 highly correlated that amplifies the  effect on these regression coefficients. 00:06:58.050 --> 00:07:09.060 To be sure that the model 3 is actually correct  so that, that's two standard deviation difference,   00:07:10.080 --> 00:07:15.690 two standard deviations of correlation wouldn't  be enough to get us from model one to model two. 00:07:15.690 --> 00:07:18.030 We would need a sample size of 3000. 00:07:18.030 --> 00:07:23.310 So when variables are highly correlated  that is referred to multicollinearity. 00:07:24.000 --> 00:07:26.670 It refers to the correlation  with the independent variables. 00:07:26.670 --> 00:07:29.100 It has nothing to do with the dependent variable. 00:07:29.100 --> 00:07:35.160 And it increases the sample size requirements  for us to other estimate effects. 00:07:35.160 --> 00:07:41.190 And this sample size or this inflation of   00:07:41.190 --> 00:07:46.170 variance estimates is quantified  by the variance inflation factor. 00:07:46.170 --> 00:07:54.870 There is the variance inflation factor,  basically what it quantifies is that   00:07:54.870 --> 00:08:01.950 how much larger the variance of estimate  is compared to a hypothetical scenario.  00:08:01.950 --> 00:08:06.570 Where the variable would be  uncorrelated with every other variable.   00:08:06.570 --> 00:08:12.660 So the variance inflation factor is  basically, it's defined as 1 divided   00:08:12.660 --> 00:08:17.730 by 1 - R square of the focal variable  on all other independent variables. 00:08:18.810 --> 00:08:24.840 It's this part of the model  here so when that goes to 0   00:08:24.840 --> 00:08:29.520 then variance inflation factor goes to infinity. 00:08:29.520 --> 00:08:34.680 When that is exactly 1 then variance  inflation factor is 1 which means   00:08:34.680 --> 00:08:38.010 that the multicollinearity is not present at all. 00:08:39.570 --> 00:08:44.430 There's a rule of thumb that many  people use that are the various   00:08:44.430 --> 00:08:48.900 inflation factors should not exceed  10 and if it does we have a problem. 00:08:48.900 --> 00:08:50.790 If it doesn't we don't have a problem. 00:08:50.790 --> 00:08:56.760 So in the previous slide I showed you  that if there's nine correlation with two   00:08:56.760 --> 00:09:02.460 variables that makes it very hard to say  which one of those is the actual effect. 00:09:02.460 --> 00:09:04.440 Because they covary so strongly together. 00:09:04.440 --> 00:09:10.080 So what is the variance inflation factor  when correlation of x1 and x2 is 0.9. 00:09:10.080 --> 00:09:18.780 We can calculate the variance inflation  factor by taking a square of this correlation. 00:09:18.780 --> 00:09:24.390 So R square is the square of correlation. That's 0.9 the second power,  00:09:24.390 --> 00:09:33.270 and then we just plug the number here do some math  and we get a variance inflation factor of 5.26. 00:09:33.270 --> 00:09:39.330 So in the previous example we would have  needed 3000 observation to say for sure that   00:09:39.330 --> 00:09:43.320 model 3 was the correct model. And not model 2 or model 4.  00:09:43.320 --> 00:09:49.350 But variance inflation factor wouldn't detect  that we have a multicollinearity issue. 00:09:49.350 --> 00:09:53.340 We had so what does it say about this rule. 00:09:53.340 --> 00:09:56.370 It is it's not a very useful rule. 00:09:56.370 --> 00:10:00.930 Ketokivi and Guide, make a  good point about this rule.  00:10:00.930 --> 00:10:04.740 And any rules in general in Journal  of Operations Management editorial.  00:10:04.740 --> 00:10:06.720 So this is from 2005. 00:10:06.720 --> 00:10:12.000 When Ketokivi and Guide took over Journal of  Operations Management as editors of chief and   00:10:12.000 --> 00:10:18.930 they first published an editorial of what is  the methodological standard for this journal. 00:10:18.930 --> 00:10:20.820 And they identified some problems. 00:10:20.820 --> 00:10:24.720 And they also identified places for improvements. 00:10:24.720 --> 00:10:28.530 So what you should not do and  what you should do and they are   00:10:28.530 --> 00:10:32.520 emphasized that you always have to  contextualize all your statistics. 00:10:32.520 --> 00:10:39.030 Like when you say the regression  coefficient is 0.2 whether it's a   00:10:39.030 --> 00:10:42.000 large effect or not depends on  the scales of both variables. 00:10:42.000 --> 00:10:43.980 And it also depends on the context. 00:10:43.980 --> 00:10:46.560 If you get a thousand years per year more for   00:10:46.560 --> 00:10:51.390 each additional year of education  that's a big effect for somebody. 00:10:51.390 --> 00:10:56.340 And it's a small effect for another person  depending on where the person lives,   00:10:56.340 --> 00:10:58.290 how much the person way it makes. 00:10:58.290 --> 00:11:01.860 So all of these statistics the interpretation   00:11:01.860 --> 00:11:06.630 requires context and they take aim at  the variance inflation factor as well. 00:11:06.630 --> 00:11:11.100 So various inflation factor quantifies,  how much larger the variation would be   00:11:11.100 --> 00:11:16.710 compared to if there was no multicollinearity  whatsoever between the independent variables. 00:11:16.710 --> 00:11:24.990 And they say that if your standard errors are  small from your analysis then who cares that they   00:11:24.990 --> 00:11:32.160 could be smaller when your variables independent  various rules will be completely independent. 00:11:32.160 --> 00:11:34.740 Which is an unrealistic scenario anyway. 00:11:34.740 --> 00:11:40.950 So if the standard errors indicate that  their estimates are precise then who   00:11:40.950 --> 00:11:43.080 cares they are precise and that's what we care. 00:11:43.080 --> 00:11:47.340 So various inflation factor doesn't  really tell us anything useful. 00:11:47.340 --> 00:11:51.810 On the other hand they also say that  in some scenarios the rule of thumb   00:11:51.810 --> 00:11:56.790 that variance inflation factor  must not exceed 10 is not enough. 00:11:56.790 --> 00:12:03.150 So in the previous example we saw that  there was 0.9 correlation corresponding   00:12:03.150 --> 00:12:11.430 to variance inflation factor of 0.5  which severely made it a lot more   00:12:11.430 --> 00:12:15.780 difficult for us to identify which  one of those models was correct. 00:12:15.780 --> 00:12:19.830 So we had a collinearity issue, it wasn't  detected by variance inflation factor. 00:12:19.830 --> 00:12:26.940 So the various inflation factor  as Ketokivi and Guide say stating   00:12:26.940 --> 00:12:32.340 that it must exceed a cut-off without  considering the context is nonsense. 00:12:32.340 --> 00:12:38.490 So that's what they say and I  agree with that statement fully. 00:12:39.330 --> 00:12:44.310 You have to always contextualize what does  a statistic mean in your particular study. 00:12:44.310 --> 00:12:51.030 Wooldridge also takes some shots at various  inflation factor and multicollinearity. 00:12:51.030 --> 00:12:57.300 So this is from the fourth edition  on introduction and he didn't address   00:12:57.300 --> 00:13:02.160 multicollinearity in the first three  editions of his book because he thinks   00:13:02.160 --> 00:13:05.730 that it is not a useful concept  or it's not important enough. 00:13:05.730 --> 00:13:10.320 Regression analysis does not make any  assumptions about multicollinearity,   00:13:10.320 --> 00:13:15.630 it makes an assumption that it's independent  variable should contribute unique information. 00:13:15.630 --> 00:13:22.840 So the variables can be perfectly correlated. But it doesn't make any assumptions beyond that.  00:13:22.840 --> 00:13:26.920 He decided that all he's gonna take up this   00:13:26.920 --> 00:13:30.490 issue because there's so much bad  advice about multicollinearity. 00:13:31.930 --> 00:13:39.580 He says that these explanations of  multicollinearity are typically wrongheaded.  00:13:39.580 --> 00:13:42.280 People explaining that it is a problem. 00:13:42.280 --> 00:13:45.310 And then if you have variance inflation factor   00:13:45.310 --> 00:13:50.680 more than 10 you have to drop variables  without really explaining the problem. 00:13:50.680 --> 00:13:55.510 And what is the consequence of  dropping variables from your model. 00:13:55.510 --> 00:14:02.110 So let's now, let's take a look at all what  it means to solve a multicollinearity problem. 00:14:02.110 --> 00:14:08.260 So to understand the multicollinearity  problem, multicollinearity is a problem   00:14:08.260 --> 00:14:14.620 in the same sense that the fever is a  disease it is not really a problem per se,   00:14:14.620 --> 00:14:19.570 it is a symptom and you don't treat  the symptom you treat the disease. 00:14:19.570 --> 00:14:25.300 So if you have a child who has  fever, typically cooling down the   00:14:25.300 --> 00:14:29.050 child by putting them outside the cold  temperature is not the right treatment. 00:14:29.050 --> 00:14:31.330 You have to look at what is the  cause of the multicollinearity,   00:14:31.330 --> 00:14:35.980 cause of the fever and fix the cause  instead of trying to fix the symptom. 00:14:35.980 --> 00:14:40.240 The typical solution for  multicollinearity problems,   00:14:40.240 --> 00:14:43.210 so how do we make x1 and x2 less correlated. 00:14:43.210 --> 00:14:45.220 Well we just drop one from the model. 00:14:45.220 --> 00:14:52.930 So let's say we drop x2 from the model  and that causes in the correct model,   00:14:52.930 --> 00:14:58.300 in the previous example the correct model  was that the effects were 0.25 both. 00:14:58.300 --> 00:15:10.990 And now if we drop x2, then the estimate of x1  will reflect the influence of x1 and x2 both. 00:15:10.990 --> 00:15:15.160 So what will happen that we will overestimate the   00:15:15.160 --> 00:15:20.050 regression coefficient beta 1 by  90% and the standards are smaller. 00:15:20.050 --> 00:15:28.570 So we will have a false sense of accuracy  on related to this severely biased estimate. 00:15:28.570 --> 00:15:34.720 And also if you have control variables  that are collinear with one another, 00:15:34.720 --> 00:15:38.290 that is irrelevant because typically we just want   00:15:38.290 --> 00:15:42.760 to know how much of the variation of the  dependent variable is explained jointly, 00:15:42.760 --> 00:15:45.580 by those controllers that  we're not really interested in   00:15:45.580 --> 00:15:49.030 which one of the controls actually  explained the dependent variable. 00:15:50.680 --> 00:15:54.550 Correlation between, collinearity  between the intermedium,   00:15:54.550 --> 00:15:57.190 the interesting variables and  the controls are important. 00:15:57.190 --> 00:16:01.090 But if you are just focusing on  controls, then it doesn't matter. 00:16:01.090 --> 00:16:09.220 Okay so treating collinearity as a problem is  the same thing as treating fever as a disease. 00:16:09.220 --> 00:16:11.410 So it's not a smart thing to do. 00:16:11.410 --> 00:16:16.780 We have to understand what are the reasons  why two variables are so highly correlated   00:16:16.780 --> 00:16:21.520 that we can't really say which one is  the cause of the dependent variable. 00:16:21.520 --> 00:16:25.240 So there are a couple of  reasons why that could happen. 00:16:25.990 --> 00:16:28.930 Multicollinearity could be happening because you   00:16:28.930 --> 00:16:31.510 have mindlessly added a lot  of variables into the model. 00:16:31.510 --> 00:16:35.710 And you shouldn't be adding  mindlessly variables to model. 00:16:35.710 --> 00:16:40.030 All variables that go to your  model must be based on theory.  00:16:40.030 --> 00:16:44.860 So just throwing hundred variables into  model typically doesn't make sense. 00:16:44.860 --> 00:16:50.320 Your models are built to test theory  and then they must be driven by theory. 00:16:50.320 --> 00:16:56.530 So what you think has a causal  effect on the Y variable must   00:16:56.530 --> 00:17:00.580 go into the model and you also must  be able to explain why, what's the   00:17:00.580 --> 00:17:05.620 mechanism that its independent variable  influences the dependent variable costly. 00:17:05.620 --> 00:17:08.050 So that is one. You have been just   00:17:08.050 --> 00:17:13.840 mindlessly data mining and that's a problem  so multicollinearity is not the problem here,   00:17:13.840 --> 00:17:17.650 the problem is that you're making  stupid modeling the decisions. 00:17:17.650 --> 00:17:22.990 The second problem is that you have distinct  constructs but their measures are highly   00:17:22.990 --> 00:17:31.120 correlated and here the primary problem is not  multicollinearity but it is discriminant validity. 00:17:31.120 --> 00:17:38.020 So if two measures of things that are  supposed to be distinct are highly   00:17:38.020 --> 00:17:41.980 correlated it's a problem of measurement validity. 00:17:41.980 --> 00:17:44.170 I'll address that in a later video. 00:17:44.170 --> 00:17:47.740 Then you have two measures of  same construct in the model. 00:17:47.740 --> 00:17:53.080 For example if you are studying the  effect company's size then you have   00:17:53.080 --> 00:17:56.890 a revenue of personnel both as  measures of sizing the model. 00:17:56.890 --> 00:18:01.390 That's not a good idea to have two  measures of the same thing in the model. 00:18:01.390 --> 00:18:06.460 Let's take an extreme example, let's  assume that we want to study the effect   00:18:06.460 --> 00:18:10.570 of person's height on person's weight  and we have two measures of height. 00:18:10.570 --> 00:18:12.250 We have centimeters and inches. 00:18:12.250 --> 00:18:18.670 It doesn't make any sense to try to  get the effect of inches independent   00:18:18.670 --> 00:18:21.400 of the effect of size, in fact  that can't even be estimated. 00:18:21.400 --> 00:18:29.020 So if you study, if you have multiple  measures of the same thing then typically   00:18:29.770 --> 00:18:36.820 you should first combine those multiple  measures in the single composite measure. 00:18:36.820 --> 00:18:37.870 I'll cover that later on. 00:18:37.870 --> 00:18:44.020 Then the final case is that you are really  interested in two closely related constructs. 00:18:44.020 --> 00:18:45.370 And they're distinct effects. 00:18:46.510 --> 00:18:48.970 For example you want to know whether a person's   00:18:48.970 --> 00:18:55.360 age or a person's tenure influences  or the customer satisfaction scores.  00:18:55.360 --> 00:18:59.620 That the doctors give to the  patients like in Hekman's study. 00:18:59.620 --> 00:19:03.640 Then you really cannot drop either one of those. 00:19:03.640 --> 00:19:09.400 You can't say that because tenure  and age are highly correlated. 00:19:09.400 --> 00:19:15.430 We are just gonna use omit tenure and  assume that all correlation between   00:19:15.430 --> 00:19:21.970 age and customer satisfaction is due to the  age only and tenure doesn't have an effect. 00:19:21.970 --> 00:19:27.820 So that is not the right choice. Instead you have to just increase the sample size. 00:19:27.820 --> 00:19:34.480 So that you can answer your complicated risk or  complex research question in a precise manner.