WEBVTT
Kind: captions
Language: en

00:00:01.617 --> 00:00:04.830
In this video, I will introduce 
you to the important concept

00:00:04.830 --> 00:00:06.995
that the linear model implies 
a correlation matrix.

00:00:07.246 --> 00:00:09.870
This is something that you will typically run into

00:00:09.870 --> 00:00:11.190
in more advanced texts.

00:00:11.190 --> 00:00:14.400
But I think it's a very useful 
principle to understand,

00:00:14.400 --> 00:00:17.010
even on the first course on quantitative analysis.

00:00:17.010 --> 00:00:19.890
So a linear model is any model,

00:00:19.890 --> 00:00:21.750
where all the relationships are linear.

00:00:21.750 --> 00:00:27.966
For example, the regression model and correlation 
matrix quantifies the linear associations

00:00:27.966 --> 00:00:31.830
between each variable, two variables 
at a time on a standardized metric.

00:00:32.230 --> 00:00:35.350
So, what does it mean that the linear 
model implies the correlation matrix?

00:00:35.350 --> 00:00:37.140
Let's take a look at this path,

00:00:37.140 --> 00:00:39.870
this regression model in the path diagram form.

00:00:39.870 --> 00:00:44.641
So we have three different 
independent variables: x1, x2 and x3,

00:00:44.641 --> 00:00:50.510
linked to dependent variable y with regression 
coefficients for these regression paths.

00:00:50.736 --> 00:00:53.346
Then we have some variation u,

00:00:53.346 --> 00:00:55.866
the error term that the model doesn't explain,

00:00:55.866 --> 00:00:59.940
and then we have these x's that are 
allowed to be freely correlated,

00:00:59.940 --> 00:01:03.570
the correlation is shown by 
these two-headed curved arrows.

00:01:05.190 --> 00:01:12.990
This principle says that the 
correlations between the x variables are,

00:01:12.990 --> 00:01:14.430
what the data gives us.

00:01:14.430 --> 00:01:17.160
So we can just calculate the 
correlation with x1 and x2,

00:01:17.160 --> 00:01:19.620
and that is taken as it is.

00:01:20.096 --> 00:01:23.426
Then we say that the correlations are free.

00:01:23.426 --> 00:01:27.361
But the correlation involving 
y depends on the model.

00:01:27.361 --> 00:01:35.061
So we can say that the correlation between 
x1 and y depends on these correlations,

00:01:35.061 --> 00:01:36.810
and the model parameters here,

00:01:36.810 --> 00:01:38.520
so it's implied by the model.

00:01:38.971 --> 00:01:42.266
What that means is that in practice

00:01:42.266 --> 00:01:45.850
we start from x1 and we trace paths.

00:01:45.850 --> 00:01:51.060
So we can check, how we get from 
beta1 to y in different ways,

00:01:51.060 --> 00:01:54.870
and then we trace all possible paths,

00:01:54.870 --> 00:01:56.580
we take a sum of those paths,

00:01:56.580 --> 00:02:01.920
and then that will provide us, what 
is the correlation with x1 and y.

00:02:01.920 --> 00:02:03.510
Let's take a look at an example.

00:02:03.510 --> 00:02:07.230
This is an important concept because 
if you understand this concept,

00:02:07.230 --> 00:02:12.480
it'll allow you to understand certain 
properties of regression analysis

00:02:12.480 --> 00:02:15.000
in a lot deeper level than you otherwise would,

00:02:15.000 --> 00:02:17.820
and it's also very useful,

00:02:17.820 --> 00:02:22.530
when you think of factor analysis 
or structural equation models,

00:02:22.530 --> 00:02:24.720
or other more complicated models.

00:02:24.720 --> 00:02:26.790
Let's do the tracing.

00:02:26.790 --> 00:02:30.690
So the idea of path analysis tracing rules is

00:02:30.690 --> 00:02:35.130
that we pick two variables,

00:02:35.130 --> 00:02:37.950
if we want to calculate the 
correlation between two variables,

00:02:37.950 --> 00:02:39.711
we pick x1 and y,

00:02:39.711 --> 00:02:44.086
then we check, in how many different 
ways we can get from x1 to y,

00:02:44.086 --> 00:02:48.450
and we can only go along these arrows down,

00:02:48.450 --> 00:02:53.190
or we can travel up and 
then along one curved arrow

00:02:53.190 --> 00:02:55.530
and then back down again.

00:02:55.530 --> 00:02:59.400
So from x1, we can get to 
y in three different ways,

00:02:59.400 --> 00:03:02.310
we can go along the direct regression path here,

00:03:02.310 --> 00:03:07.800
we can go from x1, one correlation to x2,

00:03:07.800 --> 00:03:11.280
we can't do this anymore, because 
we can only take one correlation,

00:03:11.280 --> 00:03:17.680
down to y, then we go x1 to x3 and down to y.

00:03:17.680 --> 00:03:21.610
And that's all three paths 
that we can take from x1 to y.

00:03:22.211 --> 00:03:27.870
So this gives us the following equation:

00:03:27.870 --> 00:03:30.840
So we can check that the 
correlation between x1 and y

00:03:30.840 --> 00:03:36.720
is the sum of the direct path plus 
this correlation of path times,  

00:03:36.720 --> 00:03:43.860
the direct path from x2 plus the correlation 
with x1 and x3 times the direct path,

00:03:43.860 --> 00:03:45.005
makes three.

00:03:45.005 --> 00:03:49.480
What's the interpretation of this 
correlation here, the equation?

00:03:49.480 --> 00:03:54.840
It is that the correlation between 
the x and y equals the direct effect

00:03:54.840 --> 00:03:57.870
plus any spurious effects,

00:03:57.870 --> 00:04:01.350
because x1 is correlated with x2 and x3

00:04:01.350 --> 00:04:06.660
that both have effects on y.

00:04:06.935 --> 00:04:09.995
So we are saying that this 
correlation actually here,

00:04:09.995 --> 00:04:17.880
is a product of this relationship of 
interest plus these spurious other causes,

00:04:17.880 --> 00:04:19.305
or common causes of y,

00:04:19.305 --> 00:04:20.550
that correlate with x.

00:04:20.800 --> 00:04:22.720
So that's the idea.

00:04:22.720 --> 00:04:25.620
So we get these three paths,

00:04:25.620 --> 00:04:28.560
we multiply everything along each path,

00:04:28.560 --> 00:04:31.020
and then we take the sum of these paths.

00:04:31.020 --> 00:04:37.350
So here, the path from x1 to x2 
includes the correlation here,

00:04:37.350 --> 00:04:40.650
and includes the regression paths here.

00:04:40.650 --> 00:04:42.870
So we multiply those to get the value of the path,

00:04:42.870 --> 00:04:45.660
we sum all the paths that 
give us the correlations.

00:04:45.660 --> 00:04:50.340
The importance of this rule will 
be made clear in a few slides.

00:04:50.340 --> 00:04:53.130
So that gives us the correlations

00:04:53.130 --> 00:04:56.580
but we also need the variances of variables,

00:04:56.580 --> 00:04:58.710
so those are implied by the model as well.

00:04:58.710 --> 00:05:01.560
Now we are working on correlation metric,

00:05:01.560 --> 00:05:03.695
which means the correlation is 1.

00:05:03.695 --> 00:05:06.960
But that 1 is something that 
the model implies as well.

00:05:07.436 --> 00:05:10.856
So when we have the variance of y,

00:05:10.856 --> 00:05:12.000
we have to think,

00:05:12.000 --> 00:05:16.710
how many different ways we 
can go from y to somewhere and

00:05:16.710 --> 00:05:17.811
then come back.

00:05:17.811 --> 00:05:19.770
So we can go to error term,

00:05:19.770 --> 00:05:22.080
we can go up once, then we turn back.

00:05:22.080 --> 00:05:28.560
So that is the variance of the error 
term times 1 and times 1 again,

00:05:28.560 --> 00:05:29.730
because we go back and forth.

00:05:30.181 --> 00:05:33.211
Then we have y to x1,

00:05:33.211 --> 00:05:37.170
the variance of x1 is 1 because we 
are working with standardized data,

00:05:37.170 --> 00:05:38.310
and we come back.

00:05:38.310 --> 00:05:43.320
So we have beta1 times 1, 
times beta1 on the way back,

00:05:43.320 --> 00:05:44.520
so beta1 squared.

00:05:44.520 --> 00:05:48.480
The same for x2 and back, and x3 and back.

00:05:48.705 --> 00:05:52.845
Then we have a way of going from y to x1,

00:05:52.845 --> 00:05:55.250
then one correlation to x2,

00:05:55.250 --> 00:05:57.000
and back to y.

00:05:57.000 --> 00:06:02.040
So that will be beta1 times 
the correlation, times beta2,

00:06:02.040 --> 00:06:05.310
and we can take the same path 
in the opposite direction,

00:06:05.310 --> 00:06:09.690
x2 correlation and back, so we get that.

00:06:09.690 --> 00:06:13.115
And that produces us, gives us the following math.

00:06:13.315 --> 00:06:15.385
So we have the direct effects,

00:06:15.560 --> 00:06:19.290
beta1 squared plus beta2 
squared plus beta3 squared,

00:06:19.290 --> 00:06:22.920
so we go from x1 and back, 
x2 and back, x3 and back,

00:06:22.920 --> 00:06:25.140
so that because we go back and forth,

00:06:25.140 --> 00:06:27.450
we have beta1 twice or beta1 squared,

00:06:27.450 --> 00:06:28.860
because we multiply things together.

00:06:29.636 --> 00:06:34.020
Then we have the correlational paths between,

00:06:34.020 --> 00:06:37.236
we go x1 and x2, and then back,

00:06:37.236 --> 00:06:40.020
and we go x2 and x1, and then back.

00:06:40.020 --> 00:06:42.150
So that's multiplied by 2,

00:06:42.150 --> 00:06:45.750
we do that for each pair variables,

00:06:45.750 --> 00:06:47.250
and then we have the variance of the error term.

00:06:47.250 --> 00:06:48.960
So that gives us the variance of y,

00:06:48.960 --> 00:06:53.070
wich is in correlational matrix always 1.

00:06:54.472 --> 00:06:58.650
So we can use these rules to calculate the full  

00:06:58.650 --> 00:07:01.830
correlation matrix between 
all variables in our data.

00:07:01.830 --> 00:07:05.160
So we have here the variables of x's,

00:07:05.160 --> 00:07:08.790
the variances of all variables of x's are one,

00:07:08.790 --> 00:07:11.686
because we are working on correlations.

00:07:11.686 --> 00:07:16.290
And then the correlation between x's 
is something that is given in our data.

00:07:16.290 --> 00:07:20.370
And then we have these equations 
for correlation between y and x1,

00:07:20.370 --> 00:07:22.620
y and x2, y and x3,

00:07:22.620 --> 00:07:24.806
and then the variance of y,

00:07:24.806 --> 00:07:29.006
which is the covariance of 
the variable with itself,

00:07:29.006 --> 00:07:33.750
so that equation, and this is the 
variance, not the actual value.

00:07:34.751 --> 00:07:39.030
So, why would this kind of 
model or principle be useful?

00:07:39.631 --> 00:07:41.461
The reason is that,

00:07:41.736 --> 00:07:46.081
if we know this correlation matrix from the data,

00:07:46.081 --> 00:07:51.030
then we can actually calculate 
the regression effect estimates.

00:07:51.405 --> 00:07:53.295
So we can also work backwards,

00:07:53.371 --> 00:07:56.011
so we know those correlations in the data,

00:07:56.036 --> 00:07:58.076
and then we can find out

00:07:58.076 --> 00:08:01.470
what set of regression coefficients 
beta1, beta2 and beta3,

00:08:01.470 --> 00:08:03.480
and the variance of the error term,

00:08:03.480 --> 00:08:06.330
would be compatible with this correlation matrix.

00:08:06.330 --> 00:08:10.320
So we can find the model 
parameters beta1, beta2, beta3,

00:08:10.320 --> 00:08:13.110
and variance of u, the error term,

00:08:13.110 --> 00:08:17.130
that produces this implied correlation matrix.

00:08:17.881 --> 00:08:19.410
So let's do that.

00:08:19.410 --> 00:08:23.850
Hekman's paper gives us a correlation 
matrix of all the variables.

00:08:23.850 --> 00:08:28.980
So they give the correlations for the variables

00:08:28.980 --> 00:08:30.930
before doing the introductions.

00:08:30.930 --> 00:08:36.690
So we can calculate this part of the Model 1 here

00:08:36.690 --> 00:08:39.150
using the correlations.

00:08:39.150 --> 00:08:43.101
We get estimates that are 
very close to one another.

00:08:43.101 --> 00:08:49.290
So we can see that this is -23 and this is a -23,

00:08:49.290 --> 00:08:51.300
so they're mostly the same.

00:08:51.300 --> 00:08:57.381
There is some imprecision because 
these are just two digits precision,

00:08:57.381 --> 00:08:59.490
and the correlations are two-digit precision,

00:08:59.490 --> 00:09:01.111
so we have some rounding errors.

00:09:01.111 --> 00:09:05.310
And also we have these interaction 
terms here in their model,

00:09:05.310 --> 00:09:06.780
that we don't have in this model,

00:09:06.780 --> 00:09:10.440
because they didn't present the 
correlations between these interactions

00:09:10.440 --> 00:09:11.340
and other variables.

00:09:11.340 --> 00:09:14.026
But the results are mostly the same.

00:09:14.026 --> 00:09:16.170
There is one important question now.

00:09:16.170 --> 00:09:19.080
If we look at the p-values,

00:09:19.681 --> 00:09:22.170
the p-values or they don't present the p-values,

00:09:22.170 --> 00:09:24.360
but they present the stars.

00:09:24.360 --> 00:09:28.290
So we tend to have less stars 
than they have in the paper.

00:09:29.266 --> 00:09:31.710
So it's an important question 
when we replicate something,

00:09:31.710 --> 00:09:34.560
if we don't get the right 
result, the same result then,

00:09:34.560 --> 00:09:35.340
why that's the case?

00:09:37.467 --> 00:09:39.150
To start to understand,

00:09:39.150 --> 00:09:46.710
why the p-values from our replication are 
different from Hekman's paper is useful,

00:09:46.710 --> 00:09:50.280
because it teaches you something 
about statistical analysis.

00:09:50.280 --> 00:09:56.490
So remember that the p-value is defined 
by the standard error, the estimate,

00:09:56.490 --> 00:09:59.670
and the reference distribution against,

00:09:59.670 --> 00:10:00.930
which we compare the T statistic,

00:10:00.930 --> 00:10:04.240
which is the ratio of the 
estimate under standard error.

00:10:04.240 --> 00:10:09.660
The estimates here are about 
the same as the estimates here.

00:10:10.611 --> 00:10:15.661
So, what could be different 
is the standard errors,

00:10:15.661 --> 00:10:18.930
somehow we calculated the standard 
errors differently than they do,

00:10:18.930 --> 00:10:24.610
for example, because we don't 
include these variables in the model,

00:10:24.736 --> 00:10:29.836
it is possible that our 
standard errors are larger.

00:10:29.836 --> 00:10:32.870
That's an unlikely explanation, but it's possible.

00:10:32.870 --> 00:10:35.910
And because our standards are 
larger than in Hekman's paper,

00:10:35.910 --> 00:10:39.750
then that leads to the p-value differences.

00:10:39.750 --> 00:10:44.700
So let's check if that is a plausible 
explanation for the differences.

00:10:44.700 --> 00:10:49.170
To understand, if that's a plausible explanation,

00:10:49.170 --> 00:10:53.400
we have to consider where the standard 
errors come from in a regression analysis.

00:10:53.400 --> 00:10:58.440
One way to calculate the standard errors 
is an equation that looks like that.

00:10:58.966 --> 00:11:04.080
And remember that we calculate the p-value by

00:11:04.080 --> 00:11:10.236
comparing the estimate divided by 
standard error against the t distribution.

00:11:10.840 --> 00:11:13.660
So could our standard errors be different,

00:11:13.660 --> 00:11:19.441
so are the values that we use 
different from Hekman's paper.

00:11:19.441 --> 00:11:24.790
The first thing that we notice 
is that the R-squared here,

00:11:24.790 --> 00:11:29.031
the R-squared in the formula refers to

00:11:29.031 --> 00:11:35.380
R-squared of one independent variable 
on every other variable in the model.

00:11:35.380 --> 00:11:40.550
So we calculate the standard 
error for one variable by

00:11:40.550 --> 00:11:47.620
calculating the R-squared of that variable 
on every other independent variable.

00:11:47.620 --> 00:11:51.040
So that R-squared J tells,

00:11:51.040 --> 00:11:53.980
what is unique in one independent variable,

00:11:53.980 --> 00:11:55.780
compared to other independent variables.

00:11:55.780 --> 00:12:03.070
This term here has some additional meanings 
that I will explain in a video a bit later.

00:12:03.871 --> 00:12:05.800
So if we omit variables,

00:12:05.800 --> 00:12:10.060
so Hekman's study had 15 independent 
variables in the first model,

00:12:10.060 --> 00:12:12.340
because they had three interaction terms,

00:12:12.340 --> 00:12:13.851
we only have 12.

00:12:13.851 --> 00:12:17.500
And we know that if we add variables to a model,

00:12:17.500 --> 00:12:19.090
then R-squared can only decrease.

00:12:19.090 --> 00:12:23.620
So R-squared can only increase,

00:12:23.620 --> 00:12:28.870
so our R-squared should be a bit 
smaller than Hekman's R-squared,

00:12:28.870 --> 00:12:31.540
because we have less variables in the model,

00:12:31.540 --> 00:12:32.860
we don't have the interactions.

00:12:33.636 --> 00:12:38.260
If this R-squared J decreases,

00:12:38.260 --> 00:12:44.650
then 1 minus R-squared, this 
subtraction result increases,

00:12:44.650 --> 00:12:47.740
and this causes the denominator to increase here.

00:12:48.216 --> 00:12:51.700
So we have a larger denominator here,

00:12:51.700 --> 00:12:57.790
which basically means that the 
standard errors will be smaller,

00:13:00.339 --> 00:13:02.350
just based on that consideration.

00:13:02.350 --> 00:13:04.240
So if our standard errors are smaller,

00:13:04.240 --> 00:13:07.540
then we know that our p-values 
should be smaller as well,

00:13:07.540 --> 00:13:11.800
because the estimate divided by 
standard error will be larger

00:13:11.800 --> 00:13:13.125
when a standard error gets smaller,

00:13:13.125 --> 00:13:15.015
and it will be further from 0,

00:13:15.015 --> 00:13:16.900
which means smaller p-value.

00:13:17.026 --> 00:13:22.336
Then, what happens, what's here on the top,

00:13:22.336 --> 00:13:25.576
that's the variance of the error term.

00:13:25.576 --> 00:13:29.200
In our paper, our model it is 75 ,

00:13:29.200 --> 00:13:35.110
so it's 1 minus R-squared, is 
the variance of the error term,

00:13:35.110 --> 00:13:36.400
in standardized results.

00:13:36.400 --> 00:13:42.130
So Hekman's is 0.75,

00:13:42.130 --> 00:13:43.810
ours is 0.78,

00:13:43.810 --> 00:13:46.235
so there's a 4 percentage point difference.

00:13:46.235 --> 00:13:53.526
So because we can expect this 
denominator here to be smaller,

00:13:53.526 --> 00:14:00.010
and the numerator to be a bit larger,

00:14:00.010 --> 00:14:07.330
then we are expecting the R-squared, 
the variation or standard error,

00:14:07.330 --> 00:14:09.250
to be perhaps about the same.

00:14:10.151 --> 00:14:12.610
So we can't really look at the standard errors

00:14:12.610 --> 00:14:15.520
and conclude that there is 
a clear reason to believe

00:14:15.520 --> 00:14:18.340
that our standard errors will 
be substantially different.

00:14:18.340 --> 00:14:19.900
So we conclude that,

00:14:19.900 --> 00:14:22.900
based on looking at where the 
standard errors come from,

00:14:22.900 --> 00:14:26.500
then we can't see a clear reason,

00:14:26.500 --> 00:14:30.550
why our standard errors would be 
larger than in Hekman's paper.

00:14:31.201 --> 00:14:33.361
So that's an unlikely explanation.

00:14:33.361 --> 00:14:36.271
So, why do the p-values then differ?

00:14:36.271 --> 00:14:38.441
If we have the same estimates,

00:14:38.441 --> 00:14:42.836
and we have no reason to believe that 
the standard errors differ substantially,

00:14:42.836 --> 00:14:47.890
then what remains as a 
plausible explanation is that,

00:14:47.890 --> 00:14:52.660
we are comparing this estimate 
divided by the standard error,

00:14:52.660 --> 00:14:55.990
the T statistic against a 
different distribution than Hekman,

00:14:55.990 --> 00:14:58.150
and that will produce different p-values.

00:15:00.000 --> 00:15:07.786
So if we divide our p-values by 2,

00:15:08.812 --> 00:15:12.790
we can actually get the 
same stars as Hekman mostly.

00:15:12.790 --> 00:15:14.680
So that's an interesting observation.

00:15:14.680 --> 00:15:18.130
Our p-values appear to be twice as large

00:15:18.130 --> 00:15:20.170
as the p-values by Hekman.

00:15:20.796 --> 00:15:22.956
Why would that be the case?

00:15:24.000 --> 00:15:28.780
Well this is an indication 
of Hekman actually using

00:15:29.105 --> 00:15:33.065
one-tailed tests instead of two-tailed tests.

00:15:33.065 --> 00:15:36.400
So the difference in one 
and two-tailed tests is that

00:15:36.400 --> 00:15:39.880
in a one-tailed test, you only 
look at one end of the tail,

00:15:39.880 --> 00:15:42.700
so you will get the same significance level

00:15:42.700 --> 00:15:46.060
with the smaller value of the test statistics.

00:15:46.060 --> 00:15:52.540
So here you have a value of 1.7 
required for the 5% significance level,

00:15:52.540 --> 00:15:54.485
and here with the two-tailed test,

00:15:54.485 --> 00:15:58.866
because this area here must sum to 5%,

00:15:58.866 --> 00:16:01.960
we have about 2 for the same problem.

00:16:01.960 --> 00:16:07.210
So with a one-tailed test, you basically 
take the p-value of a two-tailed test

00:16:07.210 --> 00:16:09.665
and you divide it by half.

00:16:10.141 --> 00:16:14.985
Because it is a convention 
to use two-tailed tests,

00:16:14.985 --> 00:16:21.470
then doing one-tailed tests and 
not reporting that you did so

00:16:21.470 --> 00:16:26.320
is basically the same as claiming 
that you did two-tailed tests,

00:16:26.320 --> 00:16:28.465
and that's a bit unethical.

00:16:28.465 --> 00:16:32.650
Generally, there are very little good reasons,

00:16:32.650 --> 00:16:34.330
I can't name any good reasons,

00:16:34.330 --> 00:16:35.770
for using one-tailed tests.

00:16:35.770 --> 00:16:42.160
And for example, Abelson's book on 
statistical arguments explicitly says that

00:16:42.160 --> 00:16:47.590
using one-tailed tests instead of 
two-tailed tests is practically cheating.

00:16:48.291 --> 00:16:50.631
What's interesting is that,

00:16:50.631 --> 00:16:53.140
when Hekman's paper was under review,

00:16:53.140 --> 00:16:57.851
so he has published the full 
revision history of his paper,

00:16:57.851 --> 00:17:01.726
and they included a mention that 
they used a one-tailed test,

00:17:01.726 --> 00:17:05.080
and you can see many papers 
actually do use a one-tailed test

00:17:05.080 --> 00:17:06.940
without really justifying that choice.

00:17:07.140 --> 00:17:09.180
So the choice is unjustified,

00:17:09.180 --> 00:17:10.810
but they nevertheless want to do it,

00:17:10.810 --> 00:17:13.450
presumably, because it makes the p-value smaller,

00:17:13.450 --> 00:17:14.865
and the results look better.

00:17:15.091 --> 00:17:16.830
But they mentioned,

00:17:16.830 --> 00:17:19.210
which is the right thing to do,

00:17:19.210 --> 00:17:20.865
that the p-values are one-tailed.

00:17:20.865 --> 00:17:25.450
For some reason, that part of 
the regression table footer

00:17:25.450 --> 00:17:27.100
was eliminated from the published version.

00:17:28.201 --> 00:17:31.171
So the rule of thumb,

00:17:31.171 --> 00:17:32.950
don't use one-tailed test.

00:17:32.950 --> 00:17:36.481
There is really no good reason 
for using one-tailed tests

00:17:36.481 --> 00:17:39.970
and if you do, report it clearly 
but you really shouldn't.