WEBVTT
Kind: captions
Language: en
00:00:01.617 --> 00:00:04.830
In this video, I will introduce
you to the important concept
00:00:04.830 --> 00:00:06.995
that the linear model implies
a correlation matrix.
00:00:07.246 --> 00:00:09.870
This is something that you will typically run into
00:00:09.870 --> 00:00:11.190
in more advanced texts.
00:00:11.190 --> 00:00:14.400
But I think it's a very useful
principle to understand,
00:00:14.400 --> 00:00:17.010
even on the first course on quantitative analysis.
00:00:17.010 --> 00:00:19.890
So a linear model is any model,
00:00:19.890 --> 00:00:21.750
where all the relationships are linear.
00:00:21.750 --> 00:00:27.966
For example, the regression model and correlation
matrix quantifies the linear associations
00:00:27.966 --> 00:00:31.830
between each variable, two variables
at a time on a standardized metric.
00:00:32.230 --> 00:00:35.350
So, what does it mean that the linear
model implies the correlation matrix?
00:00:35.350 --> 00:00:37.140
Let's take a look at this path,
00:00:37.140 --> 00:00:39.870
this regression model in the path diagram form.
00:00:39.870 --> 00:00:44.641
So we have three different
independent variables: x1, x2 and x3,
00:00:44.641 --> 00:00:50.510
linked to dependent variable y with regression
coefficients for these regression paths.
00:00:50.736 --> 00:00:53.346
Then we have some variation u,
00:00:53.346 --> 00:00:55.866
the error term that the model doesn't explain,
00:00:55.866 --> 00:00:59.940
and then we have these x's that are
allowed to be freely correlated,
00:00:59.940 --> 00:01:03.570
the correlation is shown by
these two-headed curved arrows.
00:01:05.190 --> 00:01:12.990
This principle says that the
correlations between the x variables are,
00:01:12.990 --> 00:01:14.430
what the data gives us.
00:01:14.430 --> 00:01:17.160
So we can just calculate the
correlation with x1 and x2,
00:01:17.160 --> 00:01:19.620
and that is taken as it is.
00:01:20.096 --> 00:01:23.426
Then we say that the correlations are free.
00:01:23.426 --> 00:01:27.361
But the correlation involving
y depends on the model.
00:01:27.361 --> 00:01:35.061
So we can say that the correlation between
x1 and y depends on these correlations,
00:01:35.061 --> 00:01:36.810
and the model parameters here,
00:01:36.810 --> 00:01:38.520
so it's implied by the model.
00:01:38.971 --> 00:01:42.266
What that means is that in practice
00:01:42.266 --> 00:01:45.850
we start from x1 and we trace paths.
00:01:45.850 --> 00:01:51.060
So we can check, how we get from
beta1 to y in different ways,
00:01:51.060 --> 00:01:54.870
and then we trace all possible paths,
00:01:54.870 --> 00:01:56.580
we take a sum of those paths,
00:01:56.580 --> 00:02:01.920
and then that will provide us, what
is the correlation with x1 and y.
00:02:01.920 --> 00:02:03.510
Let's take a look at an example.
00:02:03.510 --> 00:02:07.230
This is an important concept because
if you understand this concept,
00:02:07.230 --> 00:02:12.480
it'll allow you to understand certain
properties of regression analysis
00:02:12.480 --> 00:02:15.000
in a lot deeper level than you otherwise would,
00:02:15.000 --> 00:02:17.820
and it's also very useful,
00:02:17.820 --> 00:02:22.530
when you think of factor analysis
or structural equation models,
00:02:22.530 --> 00:02:24.720
or other more complicated models.
00:02:24.720 --> 00:02:26.790
Let's do the tracing.
00:02:26.790 --> 00:02:30.690
So the idea of path analysis tracing rules is
00:02:30.690 --> 00:02:35.130
that we pick two variables,
00:02:35.130 --> 00:02:37.950
if we want to calculate the
correlation between two variables,
00:02:37.950 --> 00:02:39.711
we pick x1 and y,
00:02:39.711 --> 00:02:44.086
then we check, in how many different
ways we can get from x1 to y,
00:02:44.086 --> 00:02:48.450
and we can only go along these arrows down,
00:02:48.450 --> 00:02:53.190
or we can travel up and
then along one curved arrow
00:02:53.190 --> 00:02:55.530
and then back down again.
00:02:55.530 --> 00:02:59.400
So from x1, we can get to
y in three different ways,
00:02:59.400 --> 00:03:02.310
we can go along the direct regression path here,
00:03:02.310 --> 00:03:07.800
we can go from x1, one correlation to x2,
00:03:07.800 --> 00:03:11.280
we can't do this anymore, because
we can only take one correlation,
00:03:11.280 --> 00:03:17.680
down to y, then we go x1 to x3 and down to y.
00:03:17.680 --> 00:03:21.610
And that's all three paths
that we can take from x1 to y.
00:03:22.211 --> 00:03:27.870
So this gives us the following equation:
00:03:27.870 --> 00:03:30.840
So we can check that the
correlation between x1 and y
00:03:30.840 --> 00:03:36.720
is the sum of the direct path plus
this correlation of path times,
00:03:36.720 --> 00:03:43.860
the direct path from x2 plus the correlation
with x1 and x3 times the direct path,
00:03:43.860 --> 00:03:45.005
makes three.
00:03:45.005 --> 00:03:49.480
What's the interpretation of this
correlation here, the equation?
00:03:49.480 --> 00:03:54.840
It is that the correlation between
the x and y equals the direct effect
00:03:54.840 --> 00:03:57.870
plus any spurious effects,
00:03:57.870 --> 00:04:01.350
because x1 is correlated with x2 and x3
00:04:01.350 --> 00:04:06.660
that both have effects on y.
00:04:06.935 --> 00:04:09.995
So we are saying that this
correlation actually here,
00:04:09.995 --> 00:04:17.880
is a product of this relationship of
interest plus these spurious other causes,
00:04:17.880 --> 00:04:19.305
or common causes of y,
00:04:19.305 --> 00:04:20.550
that correlate with x.
00:04:20.800 --> 00:04:22.720
So that's the idea.
00:04:22.720 --> 00:04:25.620
So we get these three paths,
00:04:25.620 --> 00:04:28.560
we multiply everything along each path,
00:04:28.560 --> 00:04:31.020
and then we take the sum of these paths.
00:04:31.020 --> 00:04:37.350
So here, the path from x1 to x2
includes the correlation here,
00:04:37.350 --> 00:04:40.650
and includes the regression paths here.
00:04:40.650 --> 00:04:42.870
So we multiply those to get the value of the path,
00:04:42.870 --> 00:04:45.660
we sum all the paths that
give us the correlations.
00:04:45.660 --> 00:04:50.340
The importance of this rule will
be made clear in a few slides.
00:04:50.340 --> 00:04:53.130
So that gives us the correlations
00:04:53.130 --> 00:04:56.580
but we also need the variances of variables,
00:04:56.580 --> 00:04:58.710
so those are implied by the model as well.
00:04:58.710 --> 00:05:01.560
Now we are working on correlation metric,
00:05:01.560 --> 00:05:03.695
which means the correlation is 1.
00:05:03.695 --> 00:05:06.960
But that 1 is something that
the model implies as well.
00:05:07.436 --> 00:05:10.856
So when we have the variance of y,
00:05:10.856 --> 00:05:12.000
we have to think,
00:05:12.000 --> 00:05:16.710
how many different ways we
can go from y to somewhere and
00:05:16.710 --> 00:05:17.811
then come back.
00:05:17.811 --> 00:05:19.770
So we can go to error term,
00:05:19.770 --> 00:05:22.080
we can go up once, then we turn back.
00:05:22.080 --> 00:05:28.560
So that is the variance of the error
term times 1 and times 1 again,
00:05:28.560 --> 00:05:29.730
because we go back and forth.
00:05:30.181 --> 00:05:33.211
Then we have y to x1,
00:05:33.211 --> 00:05:37.170
the variance of x1 is 1 because we
are working with standardized data,
00:05:37.170 --> 00:05:38.310
and we come back.
00:05:38.310 --> 00:05:43.320
So we have beta1 times 1,
times beta1 on the way back,
00:05:43.320 --> 00:05:44.520
so beta1 squared.
00:05:44.520 --> 00:05:48.480
The same for x2 and back, and x3 and back.
00:05:48.705 --> 00:05:52.845
Then we have a way of going from y to x1,
00:05:52.845 --> 00:05:55.250
then one correlation to x2,
00:05:55.250 --> 00:05:57.000
and back to y.
00:05:57.000 --> 00:06:02.040
So that will be beta1 times
the correlation, times beta2,
00:06:02.040 --> 00:06:05.310
and we can take the same path
in the opposite direction,
00:06:05.310 --> 00:06:09.690
x2 correlation and back, so we get that.
00:06:09.690 --> 00:06:13.115
And that produces us, gives us the following math.
00:06:13.315 --> 00:06:15.385
So we have the direct effects,
00:06:15.560 --> 00:06:19.290
beta1 squared plus beta2
squared plus beta3 squared,
00:06:19.290 --> 00:06:22.920
so we go from x1 and back,
x2 and back, x3 and back,
00:06:22.920 --> 00:06:25.140
so that because we go back and forth,
00:06:25.140 --> 00:06:27.450
we have beta1 twice or beta1 squared,
00:06:27.450 --> 00:06:28.860
because we multiply things together.
00:06:29.636 --> 00:06:34.020
Then we have the correlational paths between,
00:06:34.020 --> 00:06:37.236
we go x1 and x2, and then back,
00:06:37.236 --> 00:06:40.020
and we go x2 and x1, and then back.
00:06:40.020 --> 00:06:42.150
So that's multiplied by 2,
00:06:42.150 --> 00:06:45.750
we do that for each pair variables,
00:06:45.750 --> 00:06:47.250
and then we have the variance of the error term.
00:06:47.250 --> 00:06:48.960
So that gives us the variance of y,
00:06:48.960 --> 00:06:53.070
wich is in correlational matrix always 1.
00:06:54.472 --> 00:06:58.650
So we can use these rules to calculate the full
00:06:58.650 --> 00:07:01.830
correlation matrix between
all variables in our data.
00:07:01.830 --> 00:07:05.160
So we have here the variables of x's,
00:07:05.160 --> 00:07:08.790
the variances of all variables of x's are one,
00:07:08.790 --> 00:07:11.686
because we are working on correlations.
00:07:11.686 --> 00:07:16.290
And then the correlation between x's
is something that is given in our data.
00:07:16.290 --> 00:07:20.370
And then we have these equations
for correlation between y and x1,
00:07:20.370 --> 00:07:22.620
y and x2, y and x3,
00:07:22.620 --> 00:07:24.806
and then the variance of y,
00:07:24.806 --> 00:07:29.006
which is the covariance of
the variable with itself,
00:07:29.006 --> 00:07:33.750
so that equation, and this is the
variance, not the actual value.
00:07:34.751 --> 00:07:39.030
So, why would this kind of
model or principle be useful?
00:07:39.631 --> 00:07:41.461
The reason is that,
00:07:41.736 --> 00:07:46.081
if we know this correlation matrix from the data,
00:07:46.081 --> 00:07:51.030
then we can actually calculate
the regression effect estimates.
00:07:51.405 --> 00:07:53.295
So we can also work backwards,
00:07:53.371 --> 00:07:56.011
so we know those correlations in the data,
00:07:56.036 --> 00:07:58.076
and then we can find out
00:07:58.076 --> 00:08:01.470
what set of regression coefficients
beta1, beta2 and beta3,
00:08:01.470 --> 00:08:03.480
and the variance of the error term,
00:08:03.480 --> 00:08:06.330
would be compatible with this correlation matrix.
00:08:06.330 --> 00:08:10.320
So we can find the model
parameters beta1, beta2, beta3,
00:08:10.320 --> 00:08:13.110
and variance of u, the error term,
00:08:13.110 --> 00:08:17.130
that produces this implied correlation matrix.
00:08:17.881 --> 00:08:19.410
So let's do that.
00:08:19.410 --> 00:08:23.850
Hekman's paper gives us a correlation
matrix of all the variables.
00:08:23.850 --> 00:08:28.980
So they give the correlations for the variables
00:08:28.980 --> 00:08:30.930
before doing the introductions.
00:08:30.930 --> 00:08:36.690
So we can calculate this part of the Model 1 here
00:08:36.690 --> 00:08:39.150
using the correlations.
00:08:39.150 --> 00:08:43.101
We get estimates that are
very close to one another.
00:08:43.101 --> 00:08:49.290
So we can see that this is -23 and this is a -23,
00:08:49.290 --> 00:08:51.300
so they're mostly the same.
00:08:51.300 --> 00:08:57.381
There is some imprecision because
these are just two digits precision,
00:08:57.381 --> 00:08:59.490
and the correlations are two-digit precision,
00:08:59.490 --> 00:09:01.111
so we have some rounding errors.
00:09:01.111 --> 00:09:05.310
And also we have these interaction
terms here in their model,
00:09:05.310 --> 00:09:06.780
that we don't have in this model,
00:09:06.780 --> 00:09:10.440
because they didn't present the
correlations between these interactions
00:09:10.440 --> 00:09:11.340
and other variables.
00:09:11.340 --> 00:09:14.026
But the results are mostly the same.
00:09:14.026 --> 00:09:16.170
There is one important question now.
00:09:16.170 --> 00:09:19.080
If we look at the p-values,
00:09:19.681 --> 00:09:22.170
the p-values or they don't present the p-values,
00:09:22.170 --> 00:09:24.360
but they present the stars.
00:09:24.360 --> 00:09:28.290
So we tend to have less stars
than they have in the paper.
00:09:29.266 --> 00:09:31.710
So it's an important question
when we replicate something,
00:09:31.710 --> 00:09:34.560
if we don't get the right
result, the same result then,
00:09:34.560 --> 00:09:35.340
why that's the case?
00:09:37.467 --> 00:09:39.150
To start to understand,
00:09:39.150 --> 00:09:46.710
why the p-values from our replication are
different from Hekman's paper is useful,
00:09:46.710 --> 00:09:50.280
because it teaches you something
about statistical analysis.
00:09:50.280 --> 00:09:56.490
So remember that the p-value is defined
by the standard error, the estimate,
00:09:56.490 --> 00:09:59.670
and the reference distribution against,
00:09:59.670 --> 00:10:00.930
which we compare the T statistic,
00:10:00.930 --> 00:10:04.240
which is the ratio of the
estimate under standard error.
00:10:04.240 --> 00:10:09.660
The estimates here are about
the same as the estimates here.
00:10:10.611 --> 00:10:15.661
So, what could be different
is the standard errors,
00:10:15.661 --> 00:10:18.930
somehow we calculated the standard
errors differently than they do,
00:10:18.930 --> 00:10:24.610
for example, because we don't
include these variables in the model,
00:10:24.736 --> 00:10:29.836
it is possible that our
standard errors are larger.
00:10:29.836 --> 00:10:32.870
That's an unlikely explanation, but it's possible.
00:10:32.870 --> 00:10:35.910
And because our standards are
larger than in Hekman's paper,
00:10:35.910 --> 00:10:39.750
then that leads to the p-value differences.
00:10:39.750 --> 00:10:44.700
So let's check if that is a plausible
explanation for the differences.
00:10:44.700 --> 00:10:49.170
To understand, if that's a plausible explanation,
00:10:49.170 --> 00:10:53.400
we have to consider where the standard
errors come from in a regression analysis.
00:10:53.400 --> 00:10:58.440
One way to calculate the standard errors
is an equation that looks like that.
00:10:58.966 --> 00:11:04.080
And remember that we calculate the p-value by
00:11:04.080 --> 00:11:10.236
comparing the estimate divided by
standard error against the t distribution.
00:11:10.840 --> 00:11:13.660
So could our standard errors be different,
00:11:13.660 --> 00:11:19.441
so are the values that we use
different from Hekman's paper.
00:11:19.441 --> 00:11:24.790
The first thing that we notice
is that the R-squared here,
00:11:24.790 --> 00:11:29.031
the R-squared in the formula refers to
00:11:29.031 --> 00:11:35.380
R-squared of one independent variable
on every other variable in the model.
00:11:35.380 --> 00:11:40.550
So we calculate the standard
error for one variable by
00:11:40.550 --> 00:11:47.620
calculating the R-squared of that variable
on every other independent variable.
00:11:47.620 --> 00:11:51.040
So that R-squared J tells,
00:11:51.040 --> 00:11:53.980
what is unique in one independent variable,
00:11:53.980 --> 00:11:55.780
compared to other independent variables.
00:11:55.780 --> 00:12:03.070
This term here has some additional meanings
that I will explain in a video a bit later.
00:12:03.871 --> 00:12:05.800
So if we omit variables,
00:12:05.800 --> 00:12:10.060
so Hekman's study had 15 independent
variables in the first model,
00:12:10.060 --> 00:12:12.340
because they had three interaction terms,
00:12:12.340 --> 00:12:13.851
we only have 12.
00:12:13.851 --> 00:12:17.500
And we know that if we add variables to a model,
00:12:17.500 --> 00:12:19.090
then R-squared can only decrease.
00:12:19.090 --> 00:12:23.620
So R-squared can only increase,
00:12:23.620 --> 00:12:28.870
so our R-squared should be a bit
smaller than Hekman's R-squared,
00:12:28.870 --> 00:12:31.540
because we have less variables in the model,
00:12:31.540 --> 00:12:32.860
we don't have the interactions.
00:12:33.636 --> 00:12:38.260
If this R-squared J decreases,
00:12:38.260 --> 00:12:44.650
then 1 minus R-squared, this
subtraction result increases,
00:12:44.650 --> 00:12:47.740
and this causes the denominator to increase here.
00:12:48.216 --> 00:12:51.700
So we have a larger denominator here,
00:12:51.700 --> 00:12:57.790
which basically means that the
standard errors will be smaller,
00:13:00.339 --> 00:13:02.350
just based on that consideration.
00:13:02.350 --> 00:13:04.240
So if our standard errors are smaller,
00:13:04.240 --> 00:13:07.540
then we know that our p-values
should be smaller as well,
00:13:07.540 --> 00:13:11.800
because the estimate divided by
standard error will be larger
00:13:11.800 --> 00:13:13.125
when a standard error gets smaller,
00:13:13.125 --> 00:13:15.015
and it will be further from 0,
00:13:15.015 --> 00:13:16.900
which means smaller p-value.
00:13:17.026 --> 00:13:22.336
Then, what happens, what's here on the top,
00:13:22.336 --> 00:13:25.576
that's the variance of the error term.
00:13:25.576 --> 00:13:29.200
In our paper, our model it is 75 ,
00:13:29.200 --> 00:13:35.110
so it's 1 minus R-squared, is
the variance of the error term,
00:13:35.110 --> 00:13:36.400
in standardized results.
00:13:36.400 --> 00:13:42.130
So Hekman's is 0.75,
00:13:42.130 --> 00:13:43.810
ours is 0.78,
00:13:43.810 --> 00:13:46.235
so there's a 4 percentage point difference.
00:13:46.235 --> 00:13:53.526
So because we can expect this
denominator here to be smaller,
00:13:53.526 --> 00:14:00.010
and the numerator to be a bit larger,
00:14:00.010 --> 00:14:07.330
then we are expecting the R-squared,
the variation or standard error,
00:14:07.330 --> 00:14:09.250
to be perhaps about the same.
00:14:10.151 --> 00:14:12.610
So we can't really look at the standard errors
00:14:12.610 --> 00:14:15.520
and conclude that there is
a clear reason to believe
00:14:15.520 --> 00:14:18.340
that our standard errors will
be substantially different.
00:14:18.340 --> 00:14:19.900
So we conclude that,
00:14:19.900 --> 00:14:22.900
based on looking at where the
standard errors come from,
00:14:22.900 --> 00:14:26.500
then we can't see a clear reason,
00:14:26.500 --> 00:14:30.550
why our standard errors would be
larger than in Hekman's paper.
00:14:31.201 --> 00:14:33.361
So that's an unlikely explanation.
00:14:33.361 --> 00:14:36.271
So, why do the p-values then differ?
00:14:36.271 --> 00:14:38.441
If we have the same estimates,
00:14:38.441 --> 00:14:42.836
and we have no reason to believe that
the standard errors differ substantially,
00:14:42.836 --> 00:14:47.890
then what remains as a
plausible explanation is that,
00:14:47.890 --> 00:14:52.660
we are comparing this estimate
divided by the standard error,
00:14:52.660 --> 00:14:55.990
the T statistic against a
different distribution than Hekman,
00:14:55.990 --> 00:14:58.150
and that will produce different p-values.
00:15:00.000 --> 00:15:07.786
So if we divide our p-values by 2,
00:15:08.812 --> 00:15:12.790
we can actually get the
same stars as Hekman mostly.
00:15:12.790 --> 00:15:14.680
So that's an interesting observation.
00:15:14.680 --> 00:15:18.130
Our p-values appear to be twice as large
00:15:18.130 --> 00:15:20.170
as the p-values by Hekman.
00:15:20.796 --> 00:15:22.956
Why would that be the case?
00:15:24.000 --> 00:15:28.780
Well this is an indication
of Hekman actually using
00:15:29.105 --> 00:15:33.065
one-tailed tests instead of two-tailed tests.
00:15:33.065 --> 00:15:36.400
So the difference in one
and two-tailed tests is that
00:15:36.400 --> 00:15:39.880
in a one-tailed test, you only
look at one end of the tail,
00:15:39.880 --> 00:15:42.700
so you will get the same significance level
00:15:42.700 --> 00:15:46.060
with the smaller value of the test statistics.
00:15:46.060 --> 00:15:52.540
So here you have a value of 1.7
required for the 5% significance level,
00:15:52.540 --> 00:15:54.485
and here with the two-tailed test,
00:15:54.485 --> 00:15:58.866
because this area here must sum to 5%,
00:15:58.866 --> 00:16:01.960
we have about 2 for the same problem.
00:16:01.960 --> 00:16:07.210
So with a one-tailed test, you basically
take the p-value of a two-tailed test
00:16:07.210 --> 00:16:09.665
and you divide it by half.
00:16:10.141 --> 00:16:14.985
Because it is a convention
to use two-tailed tests,
00:16:14.985 --> 00:16:21.470
then doing one-tailed tests and
not reporting that you did so
00:16:21.470 --> 00:16:26.320
is basically the same as claiming
that you did two-tailed tests,
00:16:26.320 --> 00:16:28.465
and that's a bit unethical.
00:16:28.465 --> 00:16:32.650
Generally, there are very little good reasons,
00:16:32.650 --> 00:16:34.330
I can't name any good reasons,
00:16:34.330 --> 00:16:35.770
for using one-tailed tests.
00:16:35.770 --> 00:16:42.160
And for example, Abelson's book on
statistical arguments explicitly says that
00:16:42.160 --> 00:16:47.590
using one-tailed tests instead of
two-tailed tests is practically cheating.
00:16:48.291 --> 00:16:50.631
What's interesting is that,
00:16:50.631 --> 00:16:53.140
when Hekman's paper was under review,
00:16:53.140 --> 00:16:57.851
so he has published the full
revision history of his paper,
00:16:57.851 --> 00:17:01.726
and they included a mention that
they used a one-tailed test,
00:17:01.726 --> 00:17:05.080
and you can see many papers
actually do use a one-tailed test
00:17:05.080 --> 00:17:06.940
without really justifying that choice.
00:17:07.140 --> 00:17:09.180
So the choice is unjustified,
00:17:09.180 --> 00:17:10.810
but they nevertheless want to do it,
00:17:10.810 --> 00:17:13.450
presumably, because it makes the p-value smaller,
00:17:13.450 --> 00:17:14.865
and the results look better.
00:17:15.091 --> 00:17:16.830
But they mentioned,
00:17:16.830 --> 00:17:19.210
which is the right thing to do,
00:17:19.210 --> 00:17:20.865
that the p-values are one-tailed.
00:17:20.865 --> 00:17:25.450
For some reason, that part of
the regression table footer
00:17:25.450 --> 00:17:27.100
was eliminated from the published version.
00:17:28.201 --> 00:17:31.171
So the rule of thumb,
00:17:31.171 --> 00:17:32.950
don't use one-tailed test.
00:17:32.950 --> 00:17:36.481
There is really no good reason
for using one-tailed tests
00:17:36.481 --> 00:17:39.970
and if you do, report it clearly
but you really shouldn't.