WEBVTT Kind: captions Language: en 00:00:00.660 --> 00:00:03.660 Added variable plot or partial  regression plot is one of the most useful 00:00:03.660 --> 00:00:06.330 diagnostics plots after a regression analysis. 00:00:06.840 --> 00:00:10.590 This plot also demonstrates some  features of regression analysis. 00:00:10.871 --> 00:00:14.520 So let's take a look at what partial  regression plot actually does. 00:00:14.520 --> 00:00:18.540 And we need some data and a  regression model to do the plot. 00:00:18.540 --> 00:00:20.370 So we have the data here, 00:00:20.370 --> 00:00:21.330 the prestige data, 00:00:21.330 --> 00:00:24.390 and we run a regression of prestige on 00:00:24.390 --> 00:00:26.821 income, education and share of women. 00:00:27.102 --> 00:00:31.980 And then we do the partial regression  plots or added variable plots. 00:00:31.980 --> 00:00:32.940 and they're shown here. 00:00:33.204 --> 00:00:34.554 So these plots, 00:00:34.554 --> 00:00:37.470 we usually do them for every independent variable, 00:00:37.470 --> 00:00:42.267 but these are basically three independent plots, 00:00:42.267 --> 00:00:45.000 and we'll just be looking at the first one now, 00:00:45.000 --> 00:00:47.943 because the other ones are  done the exact same way. 00:00:48.330 --> 00:00:52.290 So this is the first added variable  plot or partial regression plot. 00:00:52.290 --> 00:00:57.330 Why it's a partial regression plot  will become clear in a few moments. 00:00:57.330 --> 00:01:01.110 But the idea here is that we have  a line that goes through the data. 00:01:01.110 --> 00:01:03.750 So this is a scatter plot of data and then 00:01:03.750 --> 00:01:04.563 there's a regression line. 00:01:04.739 --> 00:01:06.839 So what are these data about? 00:01:06.839 --> 00:01:08.893 So these are not our observations it is 00:01:08.893 --> 00:01:11.455 education conditional on others, 00:01:11.455 --> 00:01:13.710 precedes conditional on others. 00:01:13.710 --> 00:01:16.140 So understanding what these observations, 00:01:16.140 --> 00:01:18.660 or what these points here signify, 00:01:18.660 --> 00:01:19.800 what the line signifies. 00:01:19.800 --> 00:01:21.690 It's useful to understand, 00:01:21.690 --> 00:01:23.130 how this is actually calculated, 00:01:23.130 --> 00:01:24.690 and it is very simple to calculate. 00:01:25.446 --> 00:01:31.110 So this is an R code for  my own added variable plot. 00:01:31.110 --> 00:01:35.550 The idea of added variable  plot is that you first, 00:01:35.550 --> 00:01:39.660 regress one of the independent variables prestige, 00:01:39.660 --> 00:01:43.320 on other independent variables,  income and women here, 00:01:43.320 --> 00:01:47.880 then we regress the dependent variable 00:01:47.880 --> 00:01:50.880 on the other independent  variables, except prestige. 00:01:50.880 --> 00:01:52.710 And then we take the residuals. 00:01:52.710 --> 00:01:55.980 So we take residual of this  regression analysis here, 00:01:55.980 --> 00:01:59.100 and then residual for this  other regression analysis here. 00:01:59.100 --> 00:02:01.350 Those of you who don't understand R, 00:02:01.350 --> 00:02:04.740 the education here is the dependent variable, 00:02:04.740 --> 00:02:08.040 then income and women are  the independent variables. 00:02:08.040 --> 00:02:11.130 So it's pretty simple to understand. 00:02:11.130 --> 00:02:13.327 It's a slightly different way of writing 00:02:13.327 --> 00:02:17.951 education equals beta1 times income  plus beta2 times women plus beta0. 00:02:18.514 --> 00:02:20.610 Then we run a regression analysis, 00:02:20.610 --> 00:02:24.930 where we simply have the prestige  residuals as the dependent variable, 00:02:24.930 --> 00:02:27.480 the residual of education  as the dependent variable, 00:02:27.480 --> 00:02:29.100 and we do a scatter plot, 00:02:29.100 --> 00:02:30.704 and we draw the regression line. 00:02:30.704 --> 00:02:33.810 Then the result is partial regression plot. 00:02:34.425 --> 00:02:37.884 So this plot here is our building plot, 00:02:38.235 --> 00:02:42.840 and this is what the R command does, 00:02:42.840 --> 00:02:44.820 so what the AV plot command does in R, 00:02:44.820 --> 00:02:47.400 and this is using the building plot command. 00:02:47.400 --> 00:02:50.629 So we can do the exact same plot. 00:02:50.629 --> 00:02:55.710 The diagnostic plot for regression  analysis just adds some, 00:02:55.710 --> 00:02:57.120 I disagreed here, 00:02:57.120 --> 00:03:00.150 and it adds nicer labels for the plot, 00:03:00.150 --> 00:03:02.100 and the plot accesses. 00:03:02.100 --> 00:03:05.711 So this is exactly the same, otherwise. 00:03:06.327 --> 00:03:10.920 So the plot here explains or tells us, 00:03:10.920 --> 00:03:15.870 what is the relationship  between education and prestige, 00:03:15.870 --> 00:03:19.680 when we eliminate all other  variables from the model. 00:03:19.680 --> 00:03:20.850 So it tells us, 00:03:20.850 --> 00:03:23.670 what is the bivariate relationship, 00:03:23.670 --> 00:03:25.500 after we control for other variables. 00:03:26.414 --> 00:03:31.680 We can also view or consider this  from the Venn diagram perspective. 00:03:31.680 --> 00:03:36.510 So the Venn diagram perspective  on regression analysis is that 00:03:36.510 --> 00:03:39.750 we have the dependent variable here, 00:03:39.750 --> 00:03:42.090 we have, that's prestige, 00:03:42.090 --> 00:03:44.820 then we have the independent variables, 00:03:44.820 --> 00:03:49.680 this is the education and  this is the other variables. 00:03:50.506 --> 00:03:54.210 So the prestige and education are correlated, 00:03:54.210 --> 00:03:55.740 this area here is the correlation. 00:03:55.740 --> 00:03:57.210 And we want to know, 00:03:57.210 --> 00:04:02.010 what part of this overall correlation is 00:04:02.010 --> 00:04:05.640 unique to education and prestige, 00:04:05.640 --> 00:04:10.290 and not accounted for by these other variables. 00:04:10.290 --> 00:04:13.380 So we can see that there's some  overlap in all the variables 00:04:13.380 --> 00:04:16.890 and there are some unique relationships  between all of these variables, 00:04:16.890 --> 00:04:19.110 and this signifies two different variables here. 00:04:19.567 --> 00:04:21.240 So what we do here is that 00:04:21.240 --> 00:04:26.400 we regress prestige on these other variables, 00:04:26.400 --> 00:04:29.310 we regress it education on these other variables, 00:04:29.310 --> 00:04:31.320 and then we take the residual. 00:04:31.320 --> 00:04:34.139 So the residual is the part here, 00:04:34.451 --> 00:04:36.990 if this is a multivariate regression model. 00:04:36.990 --> 00:04:40.860 The residual is the part that the  other variables don't explain. 00:04:41.036 --> 00:04:44.996 So if we regress education  on these other variables, 00:04:45.084 --> 00:04:47.874 prestige on these other variables, 00:04:47.874 --> 00:04:49.519 and we take the residuals, 00:04:49.519 --> 00:04:51.769 what remains is this. 00:04:52.015 --> 00:04:54.475 So we have the residual of prestige, 00:04:54.475 --> 00:04:56.190 residual of education, 00:04:56.190 --> 00:05:04.050 and now the added variable plot tells us  graphically about this bivariate relationship. 00:05:04.489 --> 00:05:09.780 So importantly the correlation  between these two variables 00:05:09.780 --> 00:05:13.410 is now the regression coefficient, 00:05:13.709 --> 00:05:16.259 if we are using standardized estimates. 00:05:16.698 --> 00:05:22.170 So the correlation tells us  this regression coefficient. 00:05:24.016 --> 00:05:25.289 And that's how. 00:05:25.289 --> 00:05:32.610 So here, income conditional  on others is this area, 00:05:32.610 --> 00:05:37.140 after we eliminated all the variations  that the other variables explained, 00:05:37.140 --> 00:05:43.290 prestige here on the y-axis  is this prestige residual, 00:05:43.290 --> 00:05:47.340 after we eliminated the influence of  all other variables from the data. 00:05:48.958 --> 00:05:51.562 Also, another interesting feature is that, 00:05:51.562 --> 00:05:53.734 the regression coefficient, 00:05:53.734 --> 00:06:00.660 if we regress prestige on  education, income and women, 00:06:01.099 --> 00:06:06.870 and then we regress the residual from  the added variable plot regression 00:06:06.870 --> 00:06:08.700 on the other residual, 00:06:08.700 --> 00:06:15.000 so we have the residual of  prestige and residual of education. 00:06:15.000 --> 00:06:21.542 This regression coefficient here is exactly  the same as the regression coefficient here. 00:06:21.542 --> 00:06:26.317 So you can calculate regression  coefficients this way as well. 00:06:26.317 --> 00:06:30.073 So you can take variation away one by one 00:06:30.073 --> 00:06:34.020 and then you get the final regression  coefficient for the final variable 00:06:34.020 --> 00:06:37.440 would be the same as the regression coefficient, 00:06:37.440 --> 00:06:39.180 if we entered all variables at the same time. 00:06:39.426 --> 00:06:43.126 So we can check that correlation  is here, same as here. 00:06:43.126 --> 00:06:45.696 The standard errors differ because here 00:06:45.696 --> 00:06:50.560 we assume that the effects of  income and women are known, 00:06:50.560 --> 00:06:52.510 but here they're estimated. 00:06:52.510 --> 00:06:56.020 So this is slightly different for that reason. 00:06:57.004 --> 00:07:03.662 And so that regression coefficient  is actually the slope of this line. 00:07:03.855 --> 00:07:05.170 So why is this useful? 00:07:05.170 --> 00:07:09.441 It is useful because it allows  you to graphically present, 00:07:10.570 --> 00:07:14.350 how one variable influences  the dependent variable. 00:07:14.772 --> 00:07:16.602 And when you have a line, 00:07:16.602 --> 00:07:20.421 then the slope tells you everything  that you need to know about the line. 00:07:20.808 --> 00:07:24.100 But when you have more  complicated relationships like 00:07:24.100 --> 00:07:27.940 when you fit a log-transformed dependent variable, 00:07:27.940 --> 00:07:29.650 or log-transformed independent variable, 00:07:29.650 --> 00:07:31.474 or you fit a u-shape, 00:07:31.474 --> 00:07:33.430 where you have a square of a variable. 00:07:33.430 --> 00:07:37.330 Then you can use the same  kind of plotting for those, 00:07:37.330 --> 00:07:38.890 when you don't have a line, 00:07:38.890 --> 00:07:40.000 but you have a curve, 00:07:40.000 --> 00:07:45.760 and then you can check how that  curve explains the data controlling 00:07:45.760 --> 00:07:46.870 for all other variables. 00:07:46.870 --> 00:07:51.070 So this is useful not only for diagnostics 00:07:51.070 --> 00:07:52.960 but also for interpretation. 00:07:52.960 --> 00:07:55.930 And I have myself used this  kind of plots in one paper, 00:07:55.930 --> 00:07:58.420 that I've written, for interpretation purposes. 00:07:59.299 --> 00:08:06.490 Also the idea that this regression  coefficient is the same as 00:08:06.490 --> 00:08:09.109 regressing one residual on another, 00:08:09.109 --> 00:08:11.119 allows you to understand, 00:08:11.119 --> 00:08:14.320 what this paper by Aguinis  and Vandenberg is saying. 00:08:14.320 --> 00:08:16.492 So they're saying that, 00:08:16.492 --> 00:08:20.950 if we have lots of controls in the model then 00:08:20.950 --> 00:08:26.860 we're basically just analyzing residuals from 00:08:26.860 --> 00:08:30.460 a model where the dependent variables  are first regressed on those controls, 00:08:30.460 --> 00:08:33.760 and the independent variable is  regressed on those controls as well. 00:08:33.760 --> 00:08:37.360 So we are analyzing the  relationship between two residuals. 00:08:37.360 --> 00:08:39.670 Whether that is problematic or not, 00:08:39.670 --> 00:08:43.810 is something that I will not  be going into in this video, 00:08:45.093 --> 00:08:48.190 but it's technically correct to  say that this is just a residual.