WEBVTT

00:00:00.090 --> 00:00:05.190
Multicollinearity is another commonly&nbsp;
misunderstood feature of regression analysis.

00:00:05.190 --> 00:00:09.900
Multicollinearity refers to a scenario where&nbsp;
the independent variables are highly correlated.

00:00:09.900 --> 00:00:16.350
It is quite common to see studies do&nbsp;
diagnostics to detect multicollinearity.

00:00:16.350 --> 00:00:19.410
And then drop some variables from the model based&nbsp;&nbsp;

00:00:19.410 --> 00:00:25.200
on some statistics that indicate that&nbsp;
multicollinearity could be a problem.

00:00:25.200 --> 00:00:30.210
There are quite a lot of difficulties&nbsp;
or problems with that approach.
 &nbsp;

00:00:30.210 --> 00:00:36.780
Let's take a look at the Hekman's paper.
So they identified that the customer race&nbsp;&nbsp;

00:00:36.780 --> 00:00:42.600
and customer gender were highly correlated&nbsp;
with physician race and phycisian gender

00:00:42.600 --> 00:00:46.920
And therefore they decided to drop&nbsp;
customers gender and customer race&nbsp;&nbsp;

00:00:46.920 --> 00:00:51.360
from the data because that caused&nbsp;
the multicollinearity situation.&nbsp;

00:00:51.360 --> 00:00:54.750
Because these variables were&nbsp;
correlated with more than 0.9.

00:00:54.750 --> 00:01:00.930
So what is this issue about and why&nbsp;
would one like to drop variables.

00:01:00.930 --> 00:01:08.190
Multicollinearity relates to the&nbsp;
sampling variance of the OLS estimate.&nbsp;

00:01:08.190 --> 00:01:11.970
Or generally any estimator&nbsp;
that estimates linear model.

00:01:11.970 --> 00:01:17.610
So to understand multicollinearity let's take&nbsp;
a look at the variance of the OLS estimates.

00:01:18.150 --> 00:01:22.440
The variance of the OLS estimates&nbsp;
is given by that kind of equation&nbsp;&nbsp;

00:01:22.440 --> 00:01:26.730
here and this equation is also used&nbsp;
for estimating the standard errors.

00:01:28.350 --> 00:01:33.990
This equation tells us that the&nbsp;
variance of estimates depends on&nbsp;&nbsp;

00:01:33.990 --> 00:01:40.470
how well the other independent variables&nbsp;
explain the focal independent variable.&nbsp;

00:01:40.470 --> 00:01:43.920
Whose coefficients variance we are interested in.

00:01:43.920 --> 00:01:54.630
So when this r square here goes up then the&nbsp;
variance of the correlation coefficient increases.

00:01:54.630 --> 00:01:59.820
The reason is that when this R&nbsp;
square goes up then 1- R square&nbsp;&nbsp;

00:01:59.820 --> 00:02:05.610
approaches 0 and then when you multiply&nbsp;
something with something that produce 0.

00:02:05.610 --> 00:02:10.500
Then the multiplication, the result&nbsp;
will approach 0 as well and when you&nbsp;&nbsp;

00:02:10.500 --> 00:02:16.800
divide something by something that for&nbsp;
zero then you will get a large number.

00:02:16.800 --> 00:02:23.130
So the square increases, when&nbsp;
each individual or the focal&nbsp;&nbsp;

00:02:23.130 --> 00:02:26.670
variable is increasingly redundant in the model.

00:02:26.670 --> 00:02:29.670
It provides the same information&nbsp;
as the other variables.

00:02:29.670 --> 00:02:32.520
Then the standard error will increase.

00:02:33.510 --> 00:02:36.510
The R square, when our variables are more&nbsp;&nbsp;

00:02:36.510 --> 00:02:43.470
correlated then the estimates will&nbsp;
be less efficient and less precise.

00:02:43.470 --> 00:02:46.470
And also the standard error will be larger because&nbsp;&nbsp;

00:02:46.470 --> 00:02:49.530
standard error estimates the&nbsp;
precision of the estimates.

00:02:49.530 --> 00:02:53.040
So is that the problem?
Well that depends.

00:02:53.040 --> 00:02:57.750
Let's take an example, what will&nbsp;
happen when we have two highly&nbsp;&nbsp;

00:02:57.750 --> 00:03:01.830
correlated variables and what it means&nbsp;
for the regression analysis results.

00:03:01.830 --> 00:03:05.220
So we should expect when two&nbsp;
variables are highly correlated&nbsp;&nbsp;

00:03:05.220 --> 00:03:08.010
the regression results to be very imprecise.

00:03:08.730 --> 00:03:13.530
That if we repeat the study over and&nbsp;
over many times the dispersion of the&nbsp;&nbsp;

00:03:13.530 --> 00:03:16.200
estimates over multiple repeated samples is large.

00:03:18.120 --> 00:03:27.390
Here we have the correlation between x and y at&nbsp;
0.9, which is modeled based on Hekman's paper.

00:03:27.390 --> 00:03:35.370
Let's assume that the correlation between x&nbsp;
and y varies between 0.43 and 0.52 so this&nbsp;&nbsp;

00:03:35.370 --> 00:03:41.250
is the variation or this kind of dispersion&nbsp;
could easily be a result of a small sample.

00:03:41.250 --> 00:03:46.500
Let's assume that this is the&nbsp;
0.475 is the population value&nbsp;&nbsp;

00:03:46.500 --> 00:03:49.680
with the sample size of for example 100.

00:03:49.680 --> 00:03:53.610
It is very easy to get a sample correlation 0.43.

00:03:53.610 --> 00:03:58.170
Then we have correlation between x 2 and y model&nbsp;&nbsp;

00:03:58.170 --> 00:04:04.110
the same way and we have five&nbsp;
combinations of correlations.

00:04:04.110 --> 00:04:06.150
These correlations vary a little.

00:04:06.150 --> 00:04:10.920
Correlation between x and y very little,&nbsp;
correlation between X 2 and Y vary a little.

00:04:10.920 --> 00:04:16.620
Because x1 and x2 are correlated when&nbsp;
we calculate the regression model using&nbsp;&nbsp;

00:04:16.620 --> 00:04:20.460
these correlations the regression&nbsp;
estimates actually vary widely.

00:04:20.460 --> 00:04:27.240
So in this model the regression&nbsp;
coefficient is -0.2 and here it's +0,7.
 &nbsp;

00:04:28.200 --> 00:04:30.060
We have even the sign that is flipping.

00:04:30.060 --> 00:04:35.490
Now the multicollinearity problem&nbsp;
relates to the fact that because&nbsp;&nbsp;

00:04:35.490 --> 00:04:43.260
x1 and x2 are so highly correlated then it is&nbsp;
very difficult to get the unique effect of x1.

00:04:43.260 --> 00:04:50.100
Because the changes in x1 are&nbsp;
always accompanied by changes x2.&nbsp;

00:04:50.100 --> 00:04:56.100
So we don't know which one it is, considered&nbsp;
company's size and companies sizing in revenue&nbsp;&nbsp;

00:04:56.100 --> 00:05:01.080
and sizing personnel those are highly correlated&nbsp;
note that 0.9 but still highly correlated.

00:05:01.080 --> 00:05:06.150
So it's difficult to say wether for example&nbsp;
you know investment decisions depend more&nbsp;&nbsp;

00:05:06.150 --> 00:05:11.370
on the number of people or the revenues of&nbsp;
the company just based on statistical means.

00:05:11.370 --> 00:05:14.010
So what's the problem?

00:05:14.010 --> 00:05:23.220
The problem here is that if we want to say&nbsp;
that this effect of beta 1 is 0.25 and not 0&nbsp;&nbsp;

00:05:23.220 --> 00:05:27.000
we have to be able to differentiate&nbsp;
between these two correlations.

00:05:28.170 --> 00:05:33.540
And how much is the sample, how&nbsp;
much sample size would we require&nbsp;&nbsp;

00:05:33.540 --> 00:05:41.310
to say for sure that correlation&nbsp;
0.475 instead of 0.45 or 0.5.

00:05:41.310 --> 00:05:44.280
We have to understand the sampling&nbsp;
variation of a correlation.&nbsp;

00:05:44.280 --> 00:05:51.300
So the standard error, standard&nbsp;
deviation of correlation of 0.475&nbsp;&nbsp;

00:05:51.300 --> 00:05:55.410
with different sample sizes 100 its 0.05.

00:05:55.410 --> 00:06:01.950
So if our sample size is 100 then we&nbsp;
can easily get something like 0.40&nbsp;&nbsp;

00:06:01.950 --> 00:06:07.020
or 0.5 too which are less than one&nbsp;
standard deviation from this mean.

00:06:07.740 --> 00:06:12.000
We can easily get these are kind&nbsp;
of correlations with sample of 100.

00:06:12.000 --> 00:06:17.070
So when our sample size is 100&nbsp;
and x and y are correlated 0.9.

00:06:17.070 --> 00:06:23.310
We really cannot say which one of&nbsp;
these coefficients is the correct set.&nbsp;

00:06:23.310 --> 00:06:27.900
Because our sample size doesn't&nbsp;
allow us enough precision to say&nbsp;&nbsp;

00:06:27.900 --> 00:06:31.950
which of these correlations are&nbsp;
the true population correlations.&nbsp;

00:06:31.950 --> 00:06:35.880
That determine the population regression&nbsp;
coefficients that we are interested in.

00:06:35.880 --> 00:06:44.250
So the fact that these two variables&nbsp;
are highly it kind of amplifies the&nbsp;&nbsp;

00:06:44.250 --> 00:06:48.360
effect of sampling variation of this correlation.

00:06:48.360 --> 00:06:52.980
The sampling variation of correlations&nbsp;
here is small but because x and y are&nbsp;&nbsp;

00:06:52.980 --> 00:06:58.050
highly correlated that amplifies the&nbsp;
effect on these regression coefficients.

00:06:58.050 --> 00:07:09.060
To be sure that the model 3 is actually correct&nbsp;
so that, that's two standard deviation difference,&nbsp;&nbsp;

00:07:10.080 --> 00:07:15.690
two standard deviations of correlation wouldn't&nbsp;
be enough to get us from model one to model two.

00:07:15.690 --> 00:07:18.030
We would need a sample size of 3000.

00:07:18.030 --> 00:07:23.310
So when variables are highly correlated&nbsp;
that is referred to multicollinearity.

00:07:24.000 --> 00:07:26.670
It refers to the correlation&nbsp;
with the independent variables.

00:07:26.670 --> 00:07:29.100
It has nothing to do with the dependent variable.

00:07:29.100 --> 00:07:35.160
And it increases the sample size requirements&nbsp;
for us to other estimate effects.

00:07:35.160 --> 00:07:41.190
And this sample size or this inflation of&nbsp;&nbsp;

00:07:41.190 --> 00:07:46.170
variance estimates is quantified&nbsp;
by the variance inflation factor.

00:07:46.170 --> 00:07:54.870
There is the variance inflation factor,&nbsp;
basically what it quantifies is that&nbsp;&nbsp;

00:07:54.870 --> 00:08:01.950
how much larger the variance of estimate&nbsp;
is compared to a hypothetical scenario.&nbsp;

00:08:01.950 --> 00:08:06.570
Where the variable would be&nbsp;
uncorrelated with every other variable.
 &nbsp;

00:08:06.570 --> 00:08:12.660
So the variance inflation factor is&nbsp;
basically, it's defined as 1 divided&nbsp;&nbsp;

00:08:12.660 --> 00:08:17.730
by 1 - R square of the focal variable&nbsp;
on all other independent variables.

00:08:18.810 --> 00:08:24.840
It's this part of the model&nbsp;
here so when that goes to 0&nbsp;&nbsp;

00:08:24.840 --> 00:08:29.520
then variance inflation factor goes to infinity.

00:08:29.520 --> 00:08:34.680
When that is exactly 1 then variance&nbsp;
inflation factor is 1 which means&nbsp;&nbsp;

00:08:34.680 --> 00:08:38.010
that the multicollinearity is not present at all.

00:08:39.570 --> 00:08:44.430
There's a rule of thumb that many&nbsp;
people use that are the various&nbsp;&nbsp;

00:08:44.430 --> 00:08:48.900
inflation factors should not exceed&nbsp;
10 and if it does we have a problem.

00:08:48.900 --> 00:08:50.790
If it doesn't we don't have a problem.

00:08:50.790 --> 00:08:56.760
So in the previous slide I showed you&nbsp;
that if there's nine correlation with two&nbsp;&nbsp;

00:08:56.760 --> 00:09:02.460
variables that makes it very hard to say&nbsp;
which one of those is the actual effect.

00:09:02.460 --> 00:09:04.440
Because they covary so strongly together.

00:09:04.440 --> 00:09:10.080
So what is the variance inflation factor&nbsp;
when correlation of x1 and x2 is 0.9.

00:09:10.080 --> 00:09:18.780
We can calculate the variance inflation&nbsp;
factor by taking a square of this correlation.

00:09:18.780 --> 00:09:24.390
So R square is the square of correlation.
That's 0.9 the second power,&nbsp;

00:09:24.390 --> 00:09:33.270
and then we just plug the number here do some math&nbsp;
and we get a variance inflation factor of 5.26.

00:09:33.270 --> 00:09:39.330
So in the previous example we would have&nbsp;
needed 3000 observation to say for sure that&nbsp;&nbsp;

00:09:39.330 --> 00:09:43.320
model 3 was the correct model.
And not model 2 or model 4.&nbsp;

00:09:43.320 --> 00:09:49.350
But variance inflation factor wouldn't detect&nbsp;
that we have a multicollinearity issue.

00:09:49.350 --> 00:09:53.340
We had so what does it say about this rule.

00:09:53.340 --> 00:09:56.370
It is it's not a very useful rule.

00:09:56.370 --> 00:10:00.930
Ketokivi and Guide, make a&nbsp;
good point about this rule.&nbsp;

00:10:00.930 --> 00:10:04.740
And any rules in general in Journal&nbsp;
of Operations Management editorial.&nbsp;

00:10:04.740 --> 00:10:06.720
So this is from 2005.

00:10:06.720 --> 00:10:12.000
When Ketokivi and Guide took over Journal of&nbsp;
Operations Management as editors of chief and&nbsp;&nbsp;

00:10:12.000 --> 00:10:18.930
they first published an editorial of what is&nbsp;
the methodological standard for this journal.

00:10:18.930 --> 00:10:20.820
And they identified some problems.

00:10:20.820 --> 00:10:24.720
And they also identified places for improvements.

00:10:24.720 --> 00:10:28.530
So what you should not do and&nbsp;
what you should do and they are&nbsp;&nbsp;

00:10:28.530 --> 00:10:32.520
emphasized that you always have to&nbsp;
contextualize all your statistics.

00:10:32.520 --> 00:10:39.030
Like when you say the regression&nbsp;
coefficient is 0.2 whether it's a&nbsp;&nbsp;

00:10:39.030 --> 00:10:42.000
large effect or not depends on&nbsp;
the scales of both variables.

00:10:42.000 --> 00:10:43.980
And it also depends on the context.

00:10:43.980 --> 00:10:46.560
If you get a thousand years per year more for&nbsp;&nbsp;

00:10:46.560 --> 00:10:51.390
each additional year of education&nbsp;
that's a big effect for somebody.

00:10:51.390 --> 00:10:56.340
And it's a small effect for another person&nbsp;
depending on where the person lives,&nbsp;&nbsp;

00:10:56.340 --> 00:10:58.290
how much the person way it makes.

00:10:58.290 --> 00:11:01.860
So all of these statistics the interpretation&nbsp;&nbsp;

00:11:01.860 --> 00:11:06.630
requires context and they take aim at&nbsp;
the variance inflation factor as well.

00:11:06.630 --> 00:11:11.100
So various inflation factor quantifies,&nbsp;
how much larger the variation would be&nbsp;&nbsp;

00:11:11.100 --> 00:11:16.710
compared to if there was no multicollinearity&nbsp;
whatsoever between the independent variables.

00:11:16.710 --> 00:11:24.990
And they say that if your standard errors are&nbsp;
small from your analysis then who cares that they&nbsp;&nbsp;

00:11:24.990 --> 00:11:32.160
could be smaller when your variables independent&nbsp;
various rules will be completely independent.

00:11:32.160 --> 00:11:34.740
Which is an unrealistic scenario anyway.

00:11:34.740 --> 00:11:40.950
So if the standard errors indicate that&nbsp;
their estimates are precise then who&nbsp;&nbsp;

00:11:40.950 --> 00:11:43.080
cares they are precise and that's what we care.

00:11:43.080 --> 00:11:47.340
So various inflation factor doesn't&nbsp;
really tell us anything useful.

00:11:47.340 --> 00:11:51.810
On the other hand they also say that&nbsp;
in some scenarios the rule of thumb&nbsp;&nbsp;

00:11:51.810 --> 00:11:56.790
that variance inflation factor&nbsp;
must not exceed 10 is not enough.

00:11:56.790 --> 00:12:03.150
So in the previous example we saw that&nbsp;
there was 0.9 correlation corresponding&nbsp;&nbsp;

00:12:03.150 --> 00:12:11.430
to variance inflation factor of 0.5&nbsp;
which severely made it a lot more&nbsp;&nbsp;

00:12:11.430 --> 00:12:15.780
difficult for us to identify which&nbsp;
one of those models was correct.

00:12:15.780 --> 00:12:19.830
So we had a collinearity issue, it wasn't&nbsp;
detected by variance inflation factor.

00:12:19.830 --> 00:12:26.940
So the various inflation factor&nbsp;
as Ketokivi and Guide say stating&nbsp;&nbsp;

00:12:26.940 --> 00:12:32.340
that it must exceed a cut-off without&nbsp;
considering the context is nonsense.

00:12:32.340 --> 00:12:38.490
So that's what they say and I&nbsp;
agree with that statement fully.

00:12:39.330 --> 00:12:44.310
You have to always contextualize what does&nbsp;
a statistic mean in your particular study.

00:12:44.310 --> 00:12:51.030
Wooldridge also takes some shots at various&nbsp;
inflation factor and multicollinearity.

00:12:51.030 --> 00:12:57.300
So this is from the fourth edition&nbsp;
on introduction and he didn't address&nbsp;&nbsp;

00:12:57.300 --> 00:13:02.160
multicollinearity in the first three&nbsp;
editions of his book because he thinks&nbsp;&nbsp;

00:13:02.160 --> 00:13:05.730
that it is not a useful concept&nbsp;
or it's not important enough.

00:13:05.730 --> 00:13:10.320
Regression analysis does not make any&nbsp;
assumptions about multicollinearity,&nbsp;&nbsp;

00:13:10.320 --> 00:13:15.630
it makes an assumption that it's independent&nbsp;
variable should contribute unique information.

00:13:15.630 --> 00:13:22.840
So the variables can be perfectly correlated.
But it doesn't make any assumptions beyond that.&nbsp;

00:13:22.840 --> 00:13:26.920
He decided that all he's gonna take up this&nbsp;&nbsp;

00:13:26.920 --> 00:13:30.490
issue because there's so much bad&nbsp;
advice about multicollinearity.

00:13:31.930 --> 00:13:39.580
He says that these explanations of&nbsp;
multicollinearity are typically wrongheaded.&nbsp;

00:13:39.580 --> 00:13:42.280
People explaining that it is a problem.

00:13:42.280 --> 00:13:45.310
And then if you have variance inflation factor&nbsp;&nbsp;

00:13:45.310 --> 00:13:50.680
more than 10 you have to drop variables&nbsp;
without really explaining the problem.

00:13:50.680 --> 00:13:55.510
And what is the consequence of&nbsp;
dropping variables from your model.

00:13:55.510 --> 00:14:02.110
So let's now, let's take a look at all what&nbsp;
it means to solve a multicollinearity problem.

00:14:02.110 --> 00:14:08.260
So to understand the multicollinearity&nbsp;
problem, multicollinearity is a problem&nbsp;&nbsp;

00:14:08.260 --> 00:14:14.620
in the same sense that the fever is a&nbsp;
disease it is not really a problem per se,&nbsp;&nbsp;

00:14:14.620 --> 00:14:19.570
it is a symptom and you don't treat&nbsp;
the symptom you treat the disease.

00:14:19.570 --> 00:14:25.300
So if you have a child who has&nbsp;
fever, typically cooling down the&nbsp;&nbsp;

00:14:25.300 --> 00:14:29.050
child by putting them outside the cold&nbsp;
temperature is not the right treatment.

00:14:29.050 --> 00:14:31.330
You have to look at what is the&nbsp;
cause of the multicollinearity,&nbsp;&nbsp;

00:14:31.330 --> 00:14:35.980
cause of the fever and fix the cause&nbsp;
instead of trying to fix the symptom.

00:14:35.980 --> 00:14:40.240
The typical solution for&nbsp;
multicollinearity problems,&nbsp;&nbsp;

00:14:40.240 --> 00:14:43.210
so how do we make x1 and x2 less correlated.

00:14:43.210 --> 00:14:45.220
Well we just drop one from the model.

00:14:45.220 --> 00:14:52.930
So let's say we drop x2 from the model&nbsp;
and that causes in the correct model,&nbsp;&nbsp;

00:14:52.930 --> 00:14:58.300
in the previous example the correct model&nbsp;
was that the effects were 0.25 both.

00:14:58.300 --> 00:15:10.990
And now if we drop x2, then the estimate of x1&nbsp;
will reflect the influence of x1 and x2 both.

00:15:10.990 --> 00:15:15.160
So what will happen that we will overestimate the&nbsp;&nbsp;

00:15:15.160 --> 00:15:20.050
regression coefficient beta 1 by&nbsp;
90% and the standards are smaller.

00:15:20.050 --> 00:15:28.570
So we will have a false sense of accuracy&nbsp;
on related to this severely biased estimate.

00:15:28.570 --> 00:15:34.720
And also if you have control variables&nbsp;
that are collinear with one another,

00:15:34.720 --> 00:15:38.290
that is irrelevant because typically we just want&nbsp;&nbsp;

00:15:38.290 --> 00:15:42.760
to know how much of the variation of the&nbsp;
dependent variable is explained jointly,

00:15:42.760 --> 00:15:45.580
by those controllers that&nbsp;
we're not really interested in&nbsp;&nbsp;

00:15:45.580 --> 00:15:49.030
which one of the controls actually&nbsp;
explained the dependent variable.

00:15:50.680 --> 00:15:54.550
Correlation between, collinearity&nbsp;
between the intermedium,&nbsp;&nbsp;

00:15:54.550 --> 00:15:57.190
the interesting variables and&nbsp;
the controls are important.

00:15:57.190 --> 00:16:01.090
But if you are just focusing on&nbsp;
controls, then it doesn't matter.

00:16:01.090 --> 00:16:09.220
Okay so treating collinearity as a problem is&nbsp;
the same thing as treating fever as a disease.

00:16:09.220 --> 00:16:11.410
So it's not a smart thing to do.

00:16:11.410 --> 00:16:16.780
We have to understand what are the reasons&nbsp;
why two variables are so highly correlated&nbsp;&nbsp;

00:16:16.780 --> 00:16:21.520
that we can't really say which one is&nbsp;
the cause of the dependent variable.

00:16:21.520 --> 00:16:25.240
So there are a couple of&nbsp;
reasons why that could happen.

00:16:25.990 --> 00:16:28.930
Multicollinearity could be happening because you&nbsp;&nbsp;

00:16:28.930 --> 00:16:31.510
have mindlessly added a lot&nbsp;
of variables into the model.

00:16:31.510 --> 00:16:35.710
And you shouldn't be adding&nbsp;
mindlessly variables to model.

00:16:35.710 --> 00:16:40.030
All variables that go to your&nbsp;
model must be based on theory.&nbsp;

00:16:40.030 --> 00:16:44.860
So just throwing hundred variables into&nbsp;
model typically doesn't make sense.

00:16:44.860 --> 00:16:50.320
Your models are built to test theory&nbsp;
and then they must be driven by theory.

00:16:50.320 --> 00:16:56.530
So what you think has a causal&nbsp;
effect on the Y variable must&nbsp;&nbsp;

00:16:56.530 --> 00:17:00.580
go into the model and you also must&nbsp;
be able to explain why, what's the&nbsp;&nbsp;

00:17:00.580 --> 00:17:05.620
mechanism that its independent variable&nbsp;
influences the dependent variable costly.

00:17:05.620 --> 00:17:08.050
So that is one.
You have been just&nbsp;&nbsp;

00:17:08.050 --> 00:17:13.840
mindlessly data mining and that's a problem&nbsp;
so multicollinearity is not the problem here,&nbsp;&nbsp;

00:17:13.840 --> 00:17:17.650
the problem is that you're making&nbsp;
stupid modeling the decisions.

00:17:17.650 --> 00:17:22.990
The second problem is that you have distinct&nbsp;
constructs but their measures are highly&nbsp;&nbsp;

00:17:22.990 --> 00:17:31.120
correlated and here the primary problem is not&nbsp;
multicollinearity but it is discriminant validity.

00:17:31.120 --> 00:17:38.020
So if two measures of things that are&nbsp;
supposed to be distinct are highly&nbsp;&nbsp;

00:17:38.020 --> 00:17:41.980
correlated it's a problem of measurement validity.

00:17:41.980 --> 00:17:44.170
I'll address that in a later video.

00:17:44.170 --> 00:17:47.740
Then you have two measures of&nbsp;
same construct in the model.

00:17:47.740 --> 00:17:53.080
For example if you are studying the&nbsp;
effect company's size then you have&nbsp;&nbsp;

00:17:53.080 --> 00:17:56.890
a revenue of personnel both as&nbsp;
measures of sizing the model.

00:17:56.890 --> 00:18:01.390
That's not a good idea to have two&nbsp;
measures of the same thing in the model.

00:18:01.390 --> 00:18:06.460
Let's take an extreme example, let's&nbsp;
assume that we want to study the effect&nbsp;&nbsp;

00:18:06.460 --> 00:18:10.570
of person's height on person's weight&nbsp;
and we have two measures of height.

00:18:10.570 --> 00:18:12.250
We have centimeters and inches.

00:18:12.250 --> 00:18:18.670
It doesn't make any sense to try to&nbsp;
get the effect of inches independent&nbsp;&nbsp;

00:18:18.670 --> 00:18:21.400
of the effect of size, in fact&nbsp;
that can't even be estimated.

00:18:21.400 --> 00:18:29.020
So if you study, if you have multiple&nbsp;
measures of the same thing then typically&nbsp;&nbsp;

00:18:29.770 --> 00:18:36.820
you should first combine those multiple&nbsp;
measures in the single composite measure.

00:18:36.820 --> 00:18:37.870
I'll cover that later on.

00:18:37.870 --> 00:18:44.020
Then the final case is that you are really&nbsp;
interested in two closely related constructs.

00:18:44.020 --> 00:18:45.370
And they're distinct effects.

00:18:46.510 --> 00:18:48.970
For example you want to know whether a person's&nbsp;&nbsp;

00:18:48.970 --> 00:18:55.360
age or a person's tenure influences&nbsp;
or the customer satisfaction scores.&nbsp;

00:18:55.360 --> 00:18:59.620
That the doctors give to the&nbsp;
patients like in Hekman's study.

00:18:59.620 --> 00:19:03.640
Then you really cannot drop either one of those.

00:19:03.640 --> 00:19:09.400
You can't say that because tenure&nbsp;
and age are highly correlated.

00:19:09.400 --> 00:19:15.430
We are just gonna use omit tenure and&nbsp;
assume that all correlation between&nbsp;&nbsp;

00:19:15.430 --> 00:19:21.970
age and customer satisfaction is due to the&nbsp;
age only and tenure doesn't have an effect.

00:19:21.970 --> 00:19:27.820
So that is not the right choice.
Instead you have to just increase the sample size.

00:19:27.820 --> 00:19:34.480
So that you can answer your complicated risk or&nbsp;
complex research question in a precise manner.