WEBVTT
00:00:00.090 --> 00:00:05.190
Multicollinearity is another commonly
misunderstood feature of regression analysis.
00:00:05.190 --> 00:00:09.900
Multicollinearity refers to a scenario where
the independent variables are highly correlated.
00:00:09.900 --> 00:00:16.350
It is quite common to see studies do
diagnostics to detect multicollinearity.
00:00:16.350 --> 00:00:19.410
And then drop some variables from the model based
00:00:19.410 --> 00:00:25.200
on some statistics that indicate that
multicollinearity could be a problem.
00:00:25.200 --> 00:00:30.210
There are quite a lot of difficulties
or problems with that approach.
00:00:30.210 --> 00:00:36.780
Let's take a look at the Hekman's paper.
So they identified that the customer race
00:00:36.780 --> 00:00:42.600
and customer gender were highly correlated
with physician race and phycisian gender
00:00:42.600 --> 00:00:46.920
And therefore they decided to drop
customers gender and customer race
00:00:46.920 --> 00:00:51.360
from the data because that caused
the multicollinearity situation.
00:00:51.360 --> 00:00:54.750
Because these variables were
correlated with more than 0.9.
00:00:54.750 --> 00:01:00.930
So what is this issue about and why
would one like to drop variables.
00:01:00.930 --> 00:01:08.190
Multicollinearity relates to the
sampling variance of the OLS estimate.
00:01:08.190 --> 00:01:11.970
Or generally any estimator
that estimates linear model.
00:01:11.970 --> 00:01:17.610
So to understand multicollinearity let's take
a look at the variance of the OLS estimates.
00:01:18.150 --> 00:01:22.440
The variance of the OLS estimates
is given by that kind of equation
00:01:22.440 --> 00:01:26.730
here and this equation is also used
for estimating the standard errors.
00:01:28.350 --> 00:01:33.990
This equation tells us that the
variance of estimates depends on
00:01:33.990 --> 00:01:40.470
how well the other independent variables
explain the focal independent variable.
00:01:40.470 --> 00:01:43.920
Whose coefficients variance we are interested in.
00:01:43.920 --> 00:01:54.630
So when this r square here goes up then the
variance of the correlation coefficient increases.
00:01:54.630 --> 00:01:59.820
The reason is that when this R
square goes up then 1- R square
00:01:59.820 --> 00:02:05.610
approaches 0 and then when you multiply
something with something that produce 0.
00:02:05.610 --> 00:02:10.500
Then the multiplication, the result
will approach 0 as well and when you
00:02:10.500 --> 00:02:16.800
divide something by something that for
zero then you will get a large number.
00:02:16.800 --> 00:02:23.130
So the square increases, when
each individual or the focal
00:02:23.130 --> 00:02:26.670
variable is increasingly redundant in the model.
00:02:26.670 --> 00:02:29.670
It provides the same information
as the other variables.
00:02:29.670 --> 00:02:32.520
Then the standard error will increase.
00:02:33.510 --> 00:02:36.510
The R square, when our variables are more
00:02:36.510 --> 00:02:43.470
correlated then the estimates will
be less efficient and less precise.
00:02:43.470 --> 00:02:46.470
And also the standard error will be larger because
00:02:46.470 --> 00:02:49.530
standard error estimates the
precision of the estimates.
00:02:49.530 --> 00:02:53.040
So is that the problem?
Well that depends.
00:02:53.040 --> 00:02:57.750
Let's take an example, what will
happen when we have two highly
00:02:57.750 --> 00:03:01.830
correlated variables and what it means
for the regression analysis results.
00:03:01.830 --> 00:03:05.220
So we should expect when two
variables are highly correlated
00:03:05.220 --> 00:03:08.010
the regression results to be very imprecise.
00:03:08.730 --> 00:03:13.530
That if we repeat the study over and
over many times the dispersion of the
00:03:13.530 --> 00:03:16.200
estimates over multiple repeated samples is large.
00:03:18.120 --> 00:03:27.390
Here we have the correlation between x and y at
0.9, which is modeled based on Hekman's paper.
00:03:27.390 --> 00:03:35.370
Let's assume that the correlation between x
and y varies between 0.43 and 0.52 so this
00:03:35.370 --> 00:03:41.250
is the variation or this kind of dispersion
could easily be a result of a small sample.
00:03:41.250 --> 00:03:46.500
Let's assume that this is the
0.475 is the population value
00:03:46.500 --> 00:03:49.680
with the sample size of for example 100.
00:03:49.680 --> 00:03:53.610
It is very easy to get a sample correlation 0.43.
00:03:53.610 --> 00:03:58.170
Then we have correlation between x 2 and y model
00:03:58.170 --> 00:04:04.110
the same way and we have five
combinations of correlations.
00:04:04.110 --> 00:04:06.150
These correlations vary a little.
00:04:06.150 --> 00:04:10.920
Correlation between x and y very little,
correlation between X 2 and Y vary a little.
00:04:10.920 --> 00:04:16.620
Because x1 and x2 are correlated when
we calculate the regression model using
00:04:16.620 --> 00:04:20.460
these correlations the regression
estimates actually vary widely.
00:04:20.460 --> 00:04:27.240
So in this model the regression
coefficient is -0.2 and here it's +0,7.
00:04:28.200 --> 00:04:30.060
We have even the sign that is flipping.
00:04:30.060 --> 00:04:35.490
Now the multicollinearity problem
relates to the fact that because
00:04:35.490 --> 00:04:43.260
x1 and x2 are so highly correlated then it is
very difficult to get the unique effect of x1.
00:04:43.260 --> 00:04:50.100
Because the changes in x1 are
always accompanied by changes x2.
00:04:50.100 --> 00:04:56.100
So we don't know which one it is, considered
company's size and companies sizing in revenue
00:04:56.100 --> 00:05:01.080
and sizing personnel those are highly correlated
note that 0.9 but still highly correlated.
00:05:01.080 --> 00:05:06.150
So it's difficult to say wether for example
you know investment decisions depend more
00:05:06.150 --> 00:05:11.370
on the number of people or the revenues of
the company just based on statistical means.
00:05:11.370 --> 00:05:14.010
So what's the problem?
00:05:14.010 --> 00:05:23.220
The problem here is that if we want to say
that this effect of beta 1 is 0.25 and not 0
00:05:23.220 --> 00:05:27.000
we have to be able to differentiate
between these two correlations.
00:05:28.170 --> 00:05:33.540
And how much is the sample, how
much sample size would we require
00:05:33.540 --> 00:05:41.310
to say for sure that correlation
0.475 instead of 0.45 or 0.5.
00:05:41.310 --> 00:05:44.280
We have to understand the sampling
variation of a correlation.
00:05:44.280 --> 00:05:51.300
So the standard error, standard
deviation of correlation of 0.475
00:05:51.300 --> 00:05:55.410
with different sample sizes 100 its 0.05.
00:05:55.410 --> 00:06:01.950
So if our sample size is 100 then we
can easily get something like 0.40
00:06:01.950 --> 00:06:07.020
or 0.5 too which are less than one
standard deviation from this mean.
00:06:07.740 --> 00:06:12.000
We can easily get these are kind
of correlations with sample of 100.
00:06:12.000 --> 00:06:17.070
So when our sample size is 100
and x and y are correlated 0.9.
00:06:17.070 --> 00:06:23.310
We really cannot say which one of
these coefficients is the correct set.
00:06:23.310 --> 00:06:27.900
Because our sample size doesn't
allow us enough precision to say
00:06:27.900 --> 00:06:31.950
which of these correlations are
the true population correlations.
00:06:31.950 --> 00:06:35.880
That determine the population regression
coefficients that we are interested in.
00:06:35.880 --> 00:06:44.250
So the fact that these two variables
are highly it kind of amplifies the
00:06:44.250 --> 00:06:48.360
effect of sampling variation of this correlation.
00:06:48.360 --> 00:06:52.980
The sampling variation of correlations
here is small but because x and y are
00:06:52.980 --> 00:06:58.050
highly correlated that amplifies the
effect on these regression coefficients.
00:06:58.050 --> 00:07:09.060
To be sure that the model 3 is actually correct
so that, that's two standard deviation difference,
00:07:10.080 --> 00:07:15.690
two standard deviations of correlation wouldn't
be enough to get us from model one to model two.
00:07:15.690 --> 00:07:18.030
We would need a sample size of 3000.
00:07:18.030 --> 00:07:23.310
So when variables are highly correlated
that is referred to multicollinearity.
00:07:24.000 --> 00:07:26.670
It refers to the correlation
with the independent variables.
00:07:26.670 --> 00:07:29.100
It has nothing to do with the dependent variable.
00:07:29.100 --> 00:07:35.160
And it increases the sample size requirements
for us to other estimate effects.
00:07:35.160 --> 00:07:41.190
And this sample size or this inflation of
00:07:41.190 --> 00:07:46.170
variance estimates is quantified
by the variance inflation factor.
00:07:46.170 --> 00:07:54.870
There is the variance inflation factor,
basically what it quantifies is that
00:07:54.870 --> 00:08:01.950
how much larger the variance of estimate
is compared to a hypothetical scenario.
00:08:01.950 --> 00:08:06.570
Where the variable would be
uncorrelated with every other variable.
00:08:06.570 --> 00:08:12.660
So the variance inflation factor is
basically, it's defined as 1 divided
00:08:12.660 --> 00:08:17.730
by 1 - R square of the focal variable
on all other independent variables.
00:08:18.810 --> 00:08:24.840
It's this part of the model
here so when that goes to 0
00:08:24.840 --> 00:08:29.520
then variance inflation factor goes to infinity.
00:08:29.520 --> 00:08:34.680
When that is exactly 1 then variance
inflation factor is 1 which means
00:08:34.680 --> 00:08:38.010
that the multicollinearity is not present at all.
00:08:39.570 --> 00:08:44.430
There's a rule of thumb that many
people use that are the various
00:08:44.430 --> 00:08:48.900
inflation factors should not exceed
10 and if it does we have a problem.
00:08:48.900 --> 00:08:50.790
If it doesn't we don't have a problem.
00:08:50.790 --> 00:08:56.760
So in the previous slide I showed you
that if there's nine correlation with two
00:08:56.760 --> 00:09:02.460
variables that makes it very hard to say
which one of those is the actual effect.
00:09:02.460 --> 00:09:04.440
Because they covary so strongly together.
00:09:04.440 --> 00:09:10.080
So what is the variance inflation factor
when correlation of x1 and x2 is 0.9.
00:09:10.080 --> 00:09:18.780
We can calculate the variance inflation
factor by taking a square of this correlation.
00:09:18.780 --> 00:09:24.390
So R square is the square of correlation.
That's 0.9 the second power,
00:09:24.390 --> 00:09:33.270
and then we just plug the number here do some math
and we get a variance inflation factor of 5.26.
00:09:33.270 --> 00:09:39.330
So in the previous example we would have
needed 3000 observation to say for sure that
00:09:39.330 --> 00:09:43.320
model 3 was the correct model.
And not model 2 or model 4.
00:09:43.320 --> 00:09:49.350
But variance inflation factor wouldn't detect
that we have a multicollinearity issue.
00:09:49.350 --> 00:09:53.340
We had so what does it say about this rule.
00:09:53.340 --> 00:09:56.370
It is it's not a very useful rule.
00:09:56.370 --> 00:10:00.930
Ketokivi and Guide, make a
good point about this rule.
00:10:00.930 --> 00:10:04.740
And any rules in general in Journal
of Operations Management editorial.
00:10:04.740 --> 00:10:06.720
So this is from 2005.
00:10:06.720 --> 00:10:12.000
When Ketokivi and Guide took over Journal of
Operations Management as editors of chief and
00:10:12.000 --> 00:10:18.930
they first published an editorial of what is
the methodological standard for this journal.
00:10:18.930 --> 00:10:20.820
And they identified some problems.
00:10:20.820 --> 00:10:24.720
And they also identified places for improvements.
00:10:24.720 --> 00:10:28.530
So what you should not do and
what you should do and they are
00:10:28.530 --> 00:10:32.520
emphasized that you always have to
contextualize all your statistics.
00:10:32.520 --> 00:10:39.030
Like when you say the regression
coefficient is 0.2 whether it's a
00:10:39.030 --> 00:10:42.000
large effect or not depends on
the scales of both variables.
00:10:42.000 --> 00:10:43.980
And it also depends on the context.
00:10:43.980 --> 00:10:46.560
If you get a thousand years per year more for
00:10:46.560 --> 00:10:51.390
each additional year of education
that's a big effect for somebody.
00:10:51.390 --> 00:10:56.340
And it's a small effect for another person
depending on where the person lives,
00:10:56.340 --> 00:10:58.290
how much the person way it makes.
00:10:58.290 --> 00:11:01.860
So all of these statistics the interpretation
00:11:01.860 --> 00:11:06.630
requires context and they take aim at
the variance inflation factor as well.
00:11:06.630 --> 00:11:11.100
So various inflation factor quantifies,
how much larger the variation would be
00:11:11.100 --> 00:11:16.710
compared to if there was no multicollinearity
whatsoever between the independent variables.
00:11:16.710 --> 00:11:24.990
And they say that if your standard errors are
small from your analysis then who cares that they
00:11:24.990 --> 00:11:32.160
could be smaller when your variables independent
various rules will be completely independent.
00:11:32.160 --> 00:11:34.740
Which is an unrealistic scenario anyway.
00:11:34.740 --> 00:11:40.950
So if the standard errors indicate that
their estimates are precise then who
00:11:40.950 --> 00:11:43.080
cares they are precise and that's what we care.
00:11:43.080 --> 00:11:47.340
So various inflation factor doesn't
really tell us anything useful.
00:11:47.340 --> 00:11:51.810
On the other hand they also say that
in some scenarios the rule of thumb
00:11:51.810 --> 00:11:56.790
that variance inflation factor
must not exceed 10 is not enough.
00:11:56.790 --> 00:12:03.150
So in the previous example we saw that
there was 0.9 correlation corresponding
00:12:03.150 --> 00:12:11.430
to variance inflation factor of 0.5
which severely made it a lot more
00:12:11.430 --> 00:12:15.780
difficult for us to identify which
one of those models was correct.
00:12:15.780 --> 00:12:19.830
So we had a collinearity issue, it wasn't
detected by variance inflation factor.
00:12:19.830 --> 00:12:26.940
So the various inflation factor
as Ketokivi and Guide say stating
00:12:26.940 --> 00:12:32.340
that it must exceed a cut-off without
considering the context is nonsense.
00:12:32.340 --> 00:12:38.490
So that's what they say and I
agree with that statement fully.
00:12:39.330 --> 00:12:44.310
You have to always contextualize what does
a statistic mean in your particular study.
00:12:44.310 --> 00:12:51.030
Wooldridge also takes some shots at various
inflation factor and multicollinearity.
00:12:51.030 --> 00:12:57.300
So this is from the fourth edition
on introduction and he didn't address
00:12:57.300 --> 00:13:02.160
multicollinearity in the first three
editions of his book because he thinks
00:13:02.160 --> 00:13:05.730
that it is not a useful concept
or it's not important enough.
00:13:05.730 --> 00:13:10.320
Regression analysis does not make any
assumptions about multicollinearity,
00:13:10.320 --> 00:13:15.630
it makes an assumption that it's independent
variable should contribute unique information.
00:13:15.630 --> 00:13:22.840
So the variables can be perfectly correlated.
But it doesn't make any assumptions beyond that.
00:13:22.840 --> 00:13:26.920
He decided that all he's gonna take up this
00:13:26.920 --> 00:13:30.490
issue because there's so much bad
advice about multicollinearity.
00:13:31.930 --> 00:13:39.580
He says that these explanations of
multicollinearity are typically wrongheaded.
00:13:39.580 --> 00:13:42.280
People explaining that it is a problem.
00:13:42.280 --> 00:13:45.310
And then if you have variance inflation factor
00:13:45.310 --> 00:13:50.680
more than 10 you have to drop variables
without really explaining the problem.
00:13:50.680 --> 00:13:55.510
And what is the consequence of
dropping variables from your model.
00:13:55.510 --> 00:14:02.110
So let's now, let's take a look at all what
it means to solve a multicollinearity problem.
00:14:02.110 --> 00:14:08.260
So to understand the multicollinearity
problem, multicollinearity is a problem
00:14:08.260 --> 00:14:14.620
in the same sense that the fever is a
disease it is not really a problem per se,
00:14:14.620 --> 00:14:19.570
it is a symptom and you don't treat
the symptom you treat the disease.
00:14:19.570 --> 00:14:25.300
So if you have a child who has
fever, typically cooling down the
00:14:25.300 --> 00:14:29.050
child by putting them outside the cold
temperature is not the right treatment.
00:14:29.050 --> 00:14:31.330
You have to look at what is the
cause of the multicollinearity,
00:14:31.330 --> 00:14:35.980
cause of the fever and fix the cause
instead of trying to fix the symptom.
00:14:35.980 --> 00:14:40.240
The typical solution for
multicollinearity problems,
00:14:40.240 --> 00:14:43.210
so how do we make x1 and x2 less correlated.
00:14:43.210 --> 00:14:45.220
Well we just drop one from the model.
00:14:45.220 --> 00:14:52.930
So let's say we drop x2 from the model
and that causes in the correct model,
00:14:52.930 --> 00:14:58.300
in the previous example the correct model
was that the effects were 0.25 both.
00:14:58.300 --> 00:15:10.990
And now if we drop x2, then the estimate of x1
will reflect the influence of x1 and x2 both.
00:15:10.990 --> 00:15:15.160
So what will happen that we will overestimate the
00:15:15.160 --> 00:15:20.050
regression coefficient beta 1 by
90% and the standards are smaller.
00:15:20.050 --> 00:15:28.570
So we will have a false sense of accuracy
on related to this severely biased estimate.
00:15:28.570 --> 00:15:34.720
And also if you have control variables
that are collinear with one another,
00:15:34.720 --> 00:15:38.290
that is irrelevant because typically we just want
00:15:38.290 --> 00:15:42.760
to know how much of the variation of the
dependent variable is explained jointly,
00:15:42.760 --> 00:15:45.580
by those controllers that
we're not really interested in
00:15:45.580 --> 00:15:49.030
which one of the controls actually
explained the dependent variable.
00:15:50.680 --> 00:15:54.550
Correlation between, collinearity
between the intermedium,
00:15:54.550 --> 00:15:57.190
the interesting variables and
the controls are important.
00:15:57.190 --> 00:16:01.090
But if you are just focusing on
controls, then it doesn't matter.
00:16:01.090 --> 00:16:09.220
Okay so treating collinearity as a problem is
the same thing as treating fever as a disease.
00:16:09.220 --> 00:16:11.410
So it's not a smart thing to do.
00:16:11.410 --> 00:16:16.780
We have to understand what are the reasons
why two variables are so highly correlated
00:16:16.780 --> 00:16:21.520
that we can't really say which one is
the cause of the dependent variable.
00:16:21.520 --> 00:16:25.240
So there are a couple of
reasons why that could happen.
00:16:25.990 --> 00:16:28.930
Multicollinearity could be happening because you
00:16:28.930 --> 00:16:31.510
have mindlessly added a lot
of variables into the model.
00:16:31.510 --> 00:16:35.710
And you shouldn't be adding
mindlessly variables to model.
00:16:35.710 --> 00:16:40.030
All variables that go to your
model must be based on theory.
00:16:40.030 --> 00:16:44.860
So just throwing hundred variables into
model typically doesn't make sense.
00:16:44.860 --> 00:16:50.320
Your models are built to test theory
and then they must be driven by theory.
00:16:50.320 --> 00:16:56.530
So what you think has a causal
effect on the Y variable must
00:16:56.530 --> 00:17:00.580
go into the model and you also must
be able to explain why, what's the
00:17:00.580 --> 00:17:05.620
mechanism that its independent variable
influences the dependent variable costly.
00:17:05.620 --> 00:17:08.050
So that is one.
You have been just
00:17:08.050 --> 00:17:13.840
mindlessly data mining and that's a problem
so multicollinearity is not the problem here,
00:17:13.840 --> 00:17:17.650
the problem is that you're making
stupid modeling the decisions.
00:17:17.650 --> 00:17:22.990
The second problem is that you have distinct
constructs but their measures are highly
00:17:22.990 --> 00:17:31.120
correlated and here the primary problem is not
multicollinearity but it is discriminant validity.
00:17:31.120 --> 00:17:38.020
So if two measures of things that are
supposed to be distinct are highly
00:17:38.020 --> 00:17:41.980
correlated it's a problem of measurement validity.
00:17:41.980 --> 00:17:44.170
I'll address that in a later video.
00:17:44.170 --> 00:17:47.740
Then you have two measures of
same construct in the model.
00:17:47.740 --> 00:17:53.080
For example if you are studying the
effect company's size then you have
00:17:53.080 --> 00:17:56.890
a revenue of personnel both as
measures of sizing the model.
00:17:56.890 --> 00:18:01.390
That's not a good idea to have two
measures of the same thing in the model.
00:18:01.390 --> 00:18:06.460
Let's take an extreme example, let's
assume that we want to study the effect
00:18:06.460 --> 00:18:10.570
of person's height on person's weight
and we have two measures of height.
00:18:10.570 --> 00:18:12.250
We have centimeters and inches.
00:18:12.250 --> 00:18:18.670
It doesn't make any sense to try to
get the effect of inches independent
00:18:18.670 --> 00:18:21.400
of the effect of size, in fact
that can't even be estimated.
00:18:21.400 --> 00:18:29.020
So if you study, if you have multiple
measures of the same thing then typically
00:18:29.770 --> 00:18:36.820
you should first combine those multiple
measures in the single composite measure.
00:18:36.820 --> 00:18:37.870
I'll cover that later on.
00:18:37.870 --> 00:18:44.020
Then the final case is that you are really
interested in two closely related constructs.
00:18:44.020 --> 00:18:45.370
And they're distinct effects.
00:18:46.510 --> 00:18:48.970
For example you want to know whether a person's
00:18:48.970 --> 00:18:55.360
age or a person's tenure influences
or the customer satisfaction scores.
00:18:55.360 --> 00:18:59.620
That the doctors give to the
patients like in Hekman's study.
00:18:59.620 --> 00:19:03.640
Then you really cannot drop either one of those.
00:19:03.640 --> 00:19:09.400
You can't say that because tenure
and age are highly correlated.
00:19:09.400 --> 00:19:15.430
We are just gonna use omit tenure and
assume that all correlation between
00:19:15.430 --> 00:19:21.970
age and customer satisfaction is due to the
age only and tenure doesn't have an effect.
00:19:21.970 --> 00:19:27.820
So that is not the right choice.
Instead you have to just increase the sample size.
00:19:27.820 --> 00:19:34.480
So that you can answer your complicated risk or
complex research question in a precise manner.