WEBVTT Kind: captions Language: en 00:00:01.617 --> 00:00:04.830 In this video, I will introduce  you to the important concept 00:00:04.830 --> 00:00:06.995 that the linear model implies  a correlation matrix. 00:00:07.246 --> 00:00:09.870 This is something that you will typically run into 00:00:09.870 --> 00:00:11.190 in more advanced texts. 00:00:11.190 --> 00:00:14.400 But I think it's a very useful  principle to understand, 00:00:14.400 --> 00:00:17.010 even on the first course on quantitative analysis. 00:00:17.010 --> 00:00:19.890 So a linear model is any model, 00:00:19.890 --> 00:00:21.750 where all the relationships are linear. 00:00:21.750 --> 00:00:27.966 For example, the regression model and correlation  matrix quantifies the linear associations 00:00:27.966 --> 00:00:31.830 between each variable, two variables  at a time on a standardized metric. 00:00:32.230 --> 00:00:35.350 So, what does it mean that the linear  model implies the correlation matrix? 00:00:35.350 --> 00:00:37.140 Let's take a look at this path, 00:00:37.140 --> 00:00:39.870 this regression model in the path diagram form. 00:00:39.870 --> 00:00:44.641 So we have three different  independent variables: x1, x2 and x3, 00:00:44.641 --> 00:00:50.510 linked to dependent variable y with regression  coefficients for these regression paths. 00:00:50.736 --> 00:00:53.346 Then we have some variation u, 00:00:53.346 --> 00:00:55.866 the error term that the model doesn't explain, 00:00:55.866 --> 00:00:59.940 and then we have these x's that are  allowed to be freely correlated, 00:00:59.940 --> 00:01:03.570 the correlation is shown by  these two-headed curved arrows. 00:01:05.190 --> 00:01:12.990 This principle says that the  correlations between the x variables are, 00:01:12.990 --> 00:01:14.430 what the data gives us. 00:01:14.430 --> 00:01:17.160 So we can just calculate the  correlation with x1 and x2, 00:01:17.160 --> 00:01:19.620 and that is taken as it is. 00:01:20.096 --> 00:01:23.426 Then we say that the correlations are free. 00:01:23.426 --> 00:01:27.361 But the correlation involving  y depends on the model. 00:01:27.361 --> 00:01:35.061 So we can say that the correlation between  x1 and y depends on these correlations, 00:01:35.061 --> 00:01:36.810 and the model parameters here, 00:01:36.810 --> 00:01:38.520 so it's implied by the model. 00:01:38.971 --> 00:01:42.266 What that means is that in practice 00:01:42.266 --> 00:01:45.850 we start from x1 and we trace paths. 00:01:45.850 --> 00:01:51.060 So we can check, how we get from  beta1 to y in different ways, 00:01:51.060 --> 00:01:54.870 and then we trace all possible paths, 00:01:54.870 --> 00:01:56.580 we take a sum of those paths, 00:01:56.580 --> 00:02:01.920 and then that will provide us, what  is the correlation with x1 and y. 00:02:01.920 --> 00:02:03.510 Let's take a look at an example. 00:02:03.510 --> 00:02:07.230 This is an important concept because  if you understand this concept, 00:02:07.230 --> 00:02:12.480 it'll allow you to understand certain  properties of regression analysis 00:02:12.480 --> 00:02:15.000 in a lot deeper level than you otherwise would, 00:02:15.000 --> 00:02:17.820 and it's also very useful, 00:02:17.820 --> 00:02:22.530 when you think of factor analysis  or structural equation models, 00:02:22.530 --> 00:02:24.720 or other more complicated models. 00:02:24.720 --> 00:02:26.790 Let's do the tracing. 00:02:26.790 --> 00:02:30.690 So the idea of path analysis tracing rules is 00:02:30.690 --> 00:02:35.130 that we pick two variables, 00:02:35.130 --> 00:02:37.950 if we want to calculate the  correlation between two variables, 00:02:37.950 --> 00:02:39.711 we pick x1 and y, 00:02:39.711 --> 00:02:44.086 then we check, in how many different  ways we can get from x1 to y, 00:02:44.086 --> 00:02:48.450 and we can only go along these arrows down, 00:02:48.450 --> 00:02:53.190 or we can travel up and  then along one curved arrow 00:02:53.190 --> 00:02:55.530 and then back down again. 00:02:55.530 --> 00:02:59.400 So from x1, we can get to  y in three different ways, 00:02:59.400 --> 00:03:02.310 we can go along the direct regression path here, 00:03:02.310 --> 00:03:07.800 we can go from x1, one correlation to x2, 00:03:07.800 --> 00:03:11.280 we can't do this anymore, because  we can only take one correlation, 00:03:11.280 --> 00:03:17.680 down to y, then we go x1 to x3 and down to y. 00:03:17.680 --> 00:03:21.610 And that's all three paths  that we can take from x1 to y. 00:03:22.211 --> 00:03:27.870 So this gives us the following equation: 00:03:27.870 --> 00:03:30.840 So we can check that the  correlation between x1 and y 00:03:30.840 --> 00:03:36.720 is the sum of the direct path plus  this correlation of path times,   00:03:36.720 --> 00:03:43.860 the direct path from x2 plus the correlation  with x1 and x3 times the direct path, 00:03:43.860 --> 00:03:45.005 makes three. 00:03:45.005 --> 00:03:49.480 What's the interpretation of this  correlation here, the equation? 00:03:49.480 --> 00:03:54.840 It is that the correlation between  the x and y equals the direct effect 00:03:54.840 --> 00:03:57.870 plus any spurious effects, 00:03:57.870 --> 00:04:01.350 because x1 is correlated with x2 and x3 00:04:01.350 --> 00:04:06.660 that both have effects on y. 00:04:06.935 --> 00:04:09.995 So we are saying that this  correlation actually here, 00:04:09.995 --> 00:04:17.880 is a product of this relationship of  interest plus these spurious other causes, 00:04:17.880 --> 00:04:19.305 or common causes of y, 00:04:19.305 --> 00:04:20.550 that correlate with x. 00:04:20.800 --> 00:04:22.720 So that's the idea. 00:04:22.720 --> 00:04:25.620 So we get these three paths, 00:04:25.620 --> 00:04:28.560 we multiply everything along each path, 00:04:28.560 --> 00:04:31.020 and then we take the sum of these paths. 00:04:31.020 --> 00:04:37.350 So here, the path from x1 to x2  includes the correlation here, 00:04:37.350 --> 00:04:40.650 and includes the regression paths here. 00:04:40.650 --> 00:04:42.870 So we multiply those to get the value of the path, 00:04:42.870 --> 00:04:45.660 we sum all the paths that  give us the correlations. 00:04:45.660 --> 00:04:50.340 The importance of this rule will  be made clear in a few slides. 00:04:50.340 --> 00:04:53.130 So that gives us the correlations 00:04:53.130 --> 00:04:56.580 but we also need the variances of variables, 00:04:56.580 --> 00:04:58.710 so those are implied by the model as well. 00:04:58.710 --> 00:05:01.560 Now we are working on correlation metric, 00:05:01.560 --> 00:05:03.695 which means the correlation is 1. 00:05:03.695 --> 00:05:06.960 But that 1 is something that  the model implies as well. 00:05:07.436 --> 00:05:10.856 So when we have the variance of y, 00:05:10.856 --> 00:05:12.000 we have to think, 00:05:12.000 --> 00:05:16.710 how many different ways we  can go from y to somewhere and 00:05:16.710 --> 00:05:17.811 then come back. 00:05:17.811 --> 00:05:19.770 So we can go to error term, 00:05:19.770 --> 00:05:22.080 we can go up once, then we turn back. 00:05:22.080 --> 00:05:28.560 So that is the variance of the error  term times 1 and times 1 again, 00:05:28.560 --> 00:05:29.730 because we go back and forth. 00:05:30.181 --> 00:05:33.211 Then we have y to x1, 00:05:33.211 --> 00:05:37.170 the variance of x1 is 1 because we  are working with standardized data, 00:05:37.170 --> 00:05:38.310 and we come back. 00:05:38.310 --> 00:05:43.320 So we have beta1 times 1,  times beta1 on the way back, 00:05:43.320 --> 00:05:44.520 so beta1 squared. 00:05:44.520 --> 00:05:48.480 The same for x2 and back, and x3 and back. 00:05:48.705 --> 00:05:52.845 Then we have a way of going from y to x1, 00:05:52.845 --> 00:05:55.250 then one correlation to x2, 00:05:55.250 --> 00:05:57.000 and back to y. 00:05:57.000 --> 00:06:02.040 So that will be beta1 times  the correlation, times beta2, 00:06:02.040 --> 00:06:05.310 and we can take the same path  in the opposite direction, 00:06:05.310 --> 00:06:09.690 x2 correlation and back, so we get that. 00:06:09.690 --> 00:06:13.115 And that produces us, gives us the following math. 00:06:13.315 --> 00:06:15.385 So we have the direct effects, 00:06:15.560 --> 00:06:19.290 beta1 squared plus beta2  squared plus beta3 squared, 00:06:19.290 --> 00:06:22.920 so we go from x1 and back,  x2 and back, x3 and back, 00:06:22.920 --> 00:06:25.140 so that because we go back and forth, 00:06:25.140 --> 00:06:27.450 we have beta1 twice or beta1 squared, 00:06:27.450 --> 00:06:28.860 because we multiply things together. 00:06:29.636 --> 00:06:34.020 Then we have the correlational paths between, 00:06:34.020 --> 00:06:37.236 we go x1 and x2, and then back, 00:06:37.236 --> 00:06:40.020 and we go x2 and x1, and then back. 00:06:40.020 --> 00:06:42.150 So that's multiplied by 2, 00:06:42.150 --> 00:06:45.750 we do that for each pair variables, 00:06:45.750 --> 00:06:47.250 and then we have the variance of the error term. 00:06:47.250 --> 00:06:48.960 So that gives us the variance of y, 00:06:48.960 --> 00:06:53.070 wich is in correlational matrix always 1. 00:06:54.472 --> 00:06:58.650 So we can use these rules to calculate the full   00:06:58.650 --> 00:07:01.830 correlation matrix between  all variables in our data. 00:07:01.830 --> 00:07:05.160 So we have here the variables of x's, 00:07:05.160 --> 00:07:08.790 the variances of all variables of x's are one, 00:07:08.790 --> 00:07:11.686 because we are working on correlations. 00:07:11.686 --> 00:07:16.290 And then the correlation between x's  is something that is given in our data. 00:07:16.290 --> 00:07:20.370 And then we have these equations  for correlation between y and x1, 00:07:20.370 --> 00:07:22.620 y and x2, y and x3, 00:07:22.620 --> 00:07:24.806 and then the variance of y, 00:07:24.806 --> 00:07:29.006 which is the covariance of  the variable with itself, 00:07:29.006 --> 00:07:33.750 so that equation, and this is the  variance, not the actual value. 00:07:34.751 --> 00:07:39.030 So, why would this kind of  model or principle be useful? 00:07:39.631 --> 00:07:41.461 The reason is that, 00:07:41.736 --> 00:07:46.081 if we know this correlation matrix from the data, 00:07:46.081 --> 00:07:51.030 then we can actually calculate  the regression effect estimates. 00:07:51.405 --> 00:07:53.295 So we can also work backwards, 00:07:53.371 --> 00:07:56.011 so we know those correlations in the data, 00:07:56.036 --> 00:07:58.076 and then we can find out 00:07:58.076 --> 00:08:01.470 what set of regression coefficients  beta1, beta2 and beta3, 00:08:01.470 --> 00:08:03.480 and the variance of the error term, 00:08:03.480 --> 00:08:06.330 would be compatible with this correlation matrix. 00:08:06.330 --> 00:08:10.320 So we can find the model  parameters beta1, beta2, beta3, 00:08:10.320 --> 00:08:13.110 and variance of u, the error term, 00:08:13.110 --> 00:08:17.130 that produces this implied correlation matrix. 00:08:17.881 --> 00:08:19.410 So let's do that. 00:08:19.410 --> 00:08:23.850 Hekman's paper gives us a correlation  matrix of all the variables. 00:08:23.850 --> 00:08:28.980 So they give the correlations for the variables 00:08:28.980 --> 00:08:30.930 before doing the introductions. 00:08:30.930 --> 00:08:36.690 So we can calculate this part of the Model 1 here 00:08:36.690 --> 00:08:39.150 using the correlations. 00:08:39.150 --> 00:08:43.101 We get estimates that are  very close to one another. 00:08:43.101 --> 00:08:49.290 So we can see that this is -23 and this is a -23, 00:08:49.290 --> 00:08:51.300 so they're mostly the same. 00:08:51.300 --> 00:08:57.381 There is some imprecision because  these are just two digits precision, 00:08:57.381 --> 00:08:59.490 and the correlations are two-digit precision, 00:08:59.490 --> 00:09:01.111 so we have some rounding errors. 00:09:01.111 --> 00:09:05.310 And also we have these interaction  terms here in their model, 00:09:05.310 --> 00:09:06.780 that we don't have in this model, 00:09:06.780 --> 00:09:10.440 because they didn't present the  correlations between these interactions 00:09:10.440 --> 00:09:11.340 and other variables. 00:09:11.340 --> 00:09:14.026 But the results are mostly the same. 00:09:14.026 --> 00:09:16.170 There is one important question now. 00:09:16.170 --> 00:09:19.080 If we look at the p-values, 00:09:19.681 --> 00:09:22.170 the p-values or they don't present the p-values, 00:09:22.170 --> 00:09:24.360 but they present the stars. 00:09:24.360 --> 00:09:28.290 So we tend to have less stars  than they have in the paper. 00:09:29.266 --> 00:09:31.710 So it's an important question  when we replicate something, 00:09:31.710 --> 00:09:34.560 if we don't get the right  result, the same result then, 00:09:34.560 --> 00:09:35.340 why that's the case? 00:09:37.467 --> 00:09:39.150 To start to understand, 00:09:39.150 --> 00:09:46.710 why the p-values from our replication are  different from Hekman's paper is useful, 00:09:46.710 --> 00:09:50.280 because it teaches you something  about statistical analysis. 00:09:50.280 --> 00:09:56.490 So remember that the p-value is defined  by the standard error, the estimate, 00:09:56.490 --> 00:09:59.670 and the reference distribution against, 00:09:59.670 --> 00:10:00.930 which we compare the T statistic, 00:10:00.930 --> 00:10:04.240 which is the ratio of the  estimate under standard error. 00:10:04.240 --> 00:10:09.660 The estimates here are about  the same as the estimates here. 00:10:10.611 --> 00:10:15.661 So, what could be different  is the standard errors, 00:10:15.661 --> 00:10:18.930 somehow we calculated the standard  errors differently than they do, 00:10:18.930 --> 00:10:24.610 for example, because we don't  include these variables in the model, 00:10:24.736 --> 00:10:29.836 it is possible that our  standard errors are larger. 00:10:29.836 --> 00:10:32.870 That's an unlikely explanation, but it's possible. 00:10:32.870 --> 00:10:35.910 And because our standards are  larger than in Hekman's paper, 00:10:35.910 --> 00:10:39.750 then that leads to the p-value differences. 00:10:39.750 --> 00:10:44.700 So let's check if that is a plausible  explanation for the differences. 00:10:44.700 --> 00:10:49.170 To understand, if that's a plausible explanation, 00:10:49.170 --> 00:10:53.400 we have to consider where the standard  errors come from in a regression analysis. 00:10:53.400 --> 00:10:58.440 One way to calculate the standard errors  is an equation that looks like that. 00:10:58.966 --> 00:11:04.080 And remember that we calculate the p-value by 00:11:04.080 --> 00:11:10.236 comparing the estimate divided by  standard error against the t distribution. 00:11:10.840 --> 00:11:13.660 So could our standard errors be different, 00:11:13.660 --> 00:11:19.441 so are the values that we use  different from Hekman's paper. 00:11:19.441 --> 00:11:24.790 The first thing that we notice  is that the R-squared here, 00:11:24.790 --> 00:11:29.031 the R-squared in the formula refers to 00:11:29.031 --> 00:11:35.380 R-squared of one independent variable  on every other variable in the model. 00:11:35.380 --> 00:11:40.550 So we calculate the standard  error for one variable by 00:11:40.550 --> 00:11:47.620 calculating the R-squared of that variable  on every other independent variable. 00:11:47.620 --> 00:11:51.040 So that R-squared J tells, 00:11:51.040 --> 00:11:53.980 what is unique in one independent variable, 00:11:53.980 --> 00:11:55.780 compared to other independent variables. 00:11:55.780 --> 00:12:03.070 This term here has some additional meanings  that I will explain in a video a bit later. 00:12:03.871 --> 00:12:05.800 So if we omit variables, 00:12:05.800 --> 00:12:10.060 so Hekman's study had 15 independent  variables in the first model, 00:12:10.060 --> 00:12:12.340 because they had three interaction terms, 00:12:12.340 --> 00:12:13.851 we only have 12. 00:12:13.851 --> 00:12:17.500 And we know that if we add variables to a model, 00:12:17.500 --> 00:12:19.090 then R-squared can only decrease. 00:12:19.090 --> 00:12:23.620 So R-squared can only increase, 00:12:23.620 --> 00:12:28.870 so our R-squared should be a bit  smaller than Hekman's R-squared, 00:12:28.870 --> 00:12:31.540 because we have less variables in the model, 00:12:31.540 --> 00:12:32.860 we don't have the interactions. 00:12:33.636 --> 00:12:38.260 If this R-squared J decreases, 00:12:38.260 --> 00:12:44.650 then 1 minus R-squared, this  subtraction result increases, 00:12:44.650 --> 00:12:47.740 and this causes the denominator to increase here. 00:12:48.216 --> 00:12:51.700 So we have a larger denominator here, 00:12:51.700 --> 00:12:57.790 which basically means that the  standard errors will be smaller, 00:13:00.339 --> 00:13:02.350 just based on that consideration. 00:13:02.350 --> 00:13:04.240 So if our standard errors are smaller, 00:13:04.240 --> 00:13:07.540 then we know that our p-values  should be smaller as well, 00:13:07.540 --> 00:13:11.800 because the estimate divided by  standard error will be larger 00:13:11.800 --> 00:13:13.125 when a standard error gets smaller, 00:13:13.125 --> 00:13:15.015 and it will be further from 0, 00:13:15.015 --> 00:13:16.900 which means smaller p-value. 00:13:17.026 --> 00:13:22.336 Then, what happens, what's here on the top, 00:13:22.336 --> 00:13:25.576 that's the variance of the error term. 00:13:25.576 --> 00:13:29.200 In our paper, our model it is 75 , 00:13:29.200 --> 00:13:35.110 so it's 1 minus R-squared, is  the variance of the error term, 00:13:35.110 --> 00:13:36.400 in standardized results. 00:13:36.400 --> 00:13:42.130 So Hekman's is 0.75, 00:13:42.130 --> 00:13:43.810 ours is 0.78, 00:13:43.810 --> 00:13:46.235 so there's a 4 percentage point difference. 00:13:46.235 --> 00:13:53.526 So because we can expect this  denominator here to be smaller, 00:13:53.526 --> 00:14:00.010 and the numerator to be a bit larger, 00:14:00.010 --> 00:14:07.330 then we are expecting the R-squared,  the variation or standard error, 00:14:07.330 --> 00:14:09.250 to be perhaps about the same. 00:14:10.151 --> 00:14:12.610 So we can't really look at the standard errors 00:14:12.610 --> 00:14:15.520 and conclude that there is  a clear reason to believe 00:14:15.520 --> 00:14:18.340 that our standard errors will  be substantially different. 00:14:18.340 --> 00:14:19.900 So we conclude that, 00:14:19.900 --> 00:14:22.900 based on looking at where the  standard errors come from, 00:14:22.900 --> 00:14:26.500 then we can't see a clear reason, 00:14:26.500 --> 00:14:30.550 why our standard errors would be  larger than in Hekman's paper. 00:14:31.201 --> 00:14:33.361 So that's an unlikely explanation. 00:14:33.361 --> 00:14:36.271 So, why do the p-values then differ? 00:14:36.271 --> 00:14:38.441 If we have the same estimates, 00:14:38.441 --> 00:14:42.836 and we have no reason to believe that  the standard errors differ substantially, 00:14:42.836 --> 00:14:47.890 then what remains as a  plausible explanation is that, 00:14:47.890 --> 00:14:52.660 we are comparing this estimate  divided by the standard error, 00:14:52.660 --> 00:14:55.990 the T statistic against a  different distribution than Hekman, 00:14:55.990 --> 00:14:58.150 and that will produce different p-values. 00:15:00.000 --> 00:15:07.786 So if we divide our p-values by 2, 00:15:08.812 --> 00:15:12.790 we can actually get the  same stars as Hekman mostly. 00:15:12.790 --> 00:15:14.680 So that's an interesting observation. 00:15:14.680 --> 00:15:18.130 Our p-values appear to be twice as large 00:15:18.130 --> 00:15:20.170 as the p-values by Hekman. 00:15:20.796 --> 00:15:22.956 Why would that be the case? 00:15:24.000 --> 00:15:28.780 Well this is an indication  of Hekman actually using 00:15:29.105 --> 00:15:33.065 one-tailed tests instead of two-tailed tests. 00:15:33.065 --> 00:15:36.400 So the difference in one  and two-tailed tests is that 00:15:36.400 --> 00:15:39.880 in a one-tailed test, you only  look at one end of the tail, 00:15:39.880 --> 00:15:42.700 so you will get the same significance level 00:15:42.700 --> 00:15:46.060 with the smaller value of the test statistics. 00:15:46.060 --> 00:15:52.540 So here you have a value of 1.7  required for the 5% significance level, 00:15:52.540 --> 00:15:54.485 and here with the two-tailed test, 00:15:54.485 --> 00:15:58.866 because this area here must sum to 5%, 00:15:58.866 --> 00:16:01.960 we have about 2 for the same problem. 00:16:01.960 --> 00:16:07.210 So with a one-tailed test, you basically  take the p-value of a two-tailed test 00:16:07.210 --> 00:16:09.665 and you divide it by half. 00:16:10.141 --> 00:16:14.985 Because it is a convention  to use two-tailed tests, 00:16:14.985 --> 00:16:21.470 then doing one-tailed tests and  not reporting that you did so 00:16:21.470 --> 00:16:26.320 is basically the same as claiming  that you did two-tailed tests, 00:16:26.320 --> 00:16:28.465 and that's a bit unethical. 00:16:28.465 --> 00:16:32.650 Generally, there are very little good reasons, 00:16:32.650 --> 00:16:34.330 I can't name any good reasons, 00:16:34.330 --> 00:16:35.770 for using one-tailed tests. 00:16:35.770 --> 00:16:42.160 And for example, Abelson's book on  statistical arguments explicitly says that 00:16:42.160 --> 00:16:47.590 using one-tailed tests instead of  two-tailed tests is practically cheating. 00:16:48.291 --> 00:16:50.631 What's interesting is that, 00:16:50.631 --> 00:16:53.140 when Hekman's paper was under review, 00:16:53.140 --> 00:16:57.851 so he has published the full  revision history of his paper, 00:16:57.851 --> 00:17:01.726 and they included a mention that  they used a one-tailed test, 00:17:01.726 --> 00:17:05.080 and you can see many papers  actually do use a one-tailed test 00:17:05.080 --> 00:17:06.940 without really justifying that choice. 00:17:07.140 --> 00:17:09.180 So the choice is unjustified, 00:17:09.180 --> 00:17:10.810 but they nevertheless want to do it, 00:17:10.810 --> 00:17:13.450 presumably, because it makes the p-value smaller, 00:17:13.450 --> 00:17:14.865 and the results look better. 00:17:15.091 --> 00:17:16.830 But they mentioned, 00:17:16.830 --> 00:17:19.210 which is the right thing to do, 00:17:19.210 --> 00:17:20.865 that the p-values are one-tailed. 00:17:20.865 --> 00:17:25.450 For some reason, that part of  the regression table footer 00:17:25.450 --> 00:17:27.100 was eliminated from the published version. 00:17:28.201 --> 00:17:31.171 So the rule of thumb, 00:17:31.171 --> 00:17:32.950 don't use one-tailed test. 00:17:32.950 --> 00:17:36.481 There is really no good reason  for using one-tailed tests 00:17:36.481 --> 00:17:39.970 and if you do, report it clearly  but you really shouldn't.