WEBVTT 00:00:00.180 --> 00:00:05.850 In this video, I will show you one  possible workflow for regression analysis. 00:00:05.850 --> 00:00:09.180 This workflow will address  all the assumptions that are   00:00:09.180 --> 00:00:11.790 empirically testable after regression analysis. 00:00:11.790 --> 00:00:14.070 There are, of course, multiple different ways of   00:00:14.070 --> 00:00:17.190 testing assumptions. But this  is the way I like to do it. 00:00:17.190 --> 00:00:19.530 I'm using R for this example. 00:00:19.530 --> 00:00:24.630 But all of these tests and diagnostics  can be done with Stata as well. 00:00:24.630 --> 00:00:27.690 And most of them can be done with SPSS. 00:00:27.690 --> 00:00:33.780 Regression analysis workflow, and  any other statistical analysis   00:00:33.780 --> 00:00:37.530 workflow first starts by stating  a hypothesis that we want to test, 00:00:37.530 --> 00:00:41.190 then we collect some data  for testing the hypothesis. 00:00:41.190 --> 00:00:46.410 After that, we explore data so it is  important to understand the relationships. 00:00:46.410 --> 00:00:49.320 Then we estimate the first regression model,   00:00:49.320 --> 00:00:52.710 where we have the independent  variables and the dependent variable. 00:00:52.710 --> 00:00:57.150 Then we check the results  briefly, to see what they're like. 00:00:57.150 --> 00:00:59.100 And we proceed with diagnostics. 00:00:59.100 --> 00:01:06.180 So the diagnostics include various plots,  and I prefer plots over statistical tests. 00:01:06.180 --> 00:01:11.190 The reason is that while you can, for  example, do a test for heteroskedasticity. 00:01:11.190 --> 00:01:14.700 That test will only tell you  whether there's a problem or not,   00:01:14.700 --> 00:01:17.400 it will not tell you the nature of the problem. 00:01:17.400 --> 00:01:21.630 It is much more informative to look at the actual   00:01:21.630 --> 00:01:27.030 distribution of the residuals to see what  is the heteroskedasticity problem like. 00:01:27.030 --> 00:01:30.330 And also if you just look or eyeball these graphs,   00:01:30.330 --> 00:01:35.400 you will basically identify the same  thing that the test tells for you. 00:01:35.400 --> 00:01:40.020 So I don't generally use tests  unless someone asked me to do so. 00:01:40.020 --> 00:01:44.310 Then when I have done the diagnostics, I  figure out what is the biggest problem. 00:01:44.310 --> 00:01:50.730 And once I have fixed the biggest problem,  then I go back to do a regression model. 00:01:50.730 --> 00:01:53.670 For example, I may identify that there are some   00:01:53.670 --> 00:01:56.670 nonlinear relationships that  I didn't think of in advance, 00:01:56.670 --> 00:01:59.250 or I may identify some outliers,  00:01:59.250 --> 00:02:05.520 or I may identify some heteroskedasticity, I go back to fit another regression model,  00:02:05.520 --> 00:02:08.970 where I have fixed the problem, then I do diagnostics again. 00:02:08.970 --> 00:02:15.510 And once I'm happy, then I conclude that  that is my final model after the diagnostics,   00:02:15.510 --> 00:02:20.340 I do possibly nested model tests  against alternative models. 00:02:20.340 --> 00:02:24.840 And then comes the fun part, I interpret  what the regression coefficients mean. 00:02:24.840 --> 00:02:28.770 So I don't just state that there  is some coefficient of 0.02.  00:02:28.770 --> 00:02:31.950 I tell what it means in my  particular research context. 00:02:31.950 --> 00:02:35.730 And that is the hard part in regression analysis. 00:02:35.730 --> 00:02:40.620 To demonstrate the regression  analysis, diagnostics,   00:02:40.620 --> 00:02:44.130 reading some data, we are going to  be using the prestige dataset again. 00:02:44.130 --> 00:02:48.390 And our dependent variable  is the prestige this time,   00:02:48.390 --> 00:02:52.680 and we're going to be using education income  and share of women as independent variables. 00:02:52.680 --> 00:02:54.600 So that is a regression model. 00:02:54.600 --> 00:03:00.780 And the regression estimates are here, we have gone through these estimates   00:03:00.780 --> 00:03:04.560 before in a previous video, so I will not explain them in detail. 00:03:04.560 --> 00:03:09.690 Instead, I'm going to be focusing  now on the assumptions checking. 00:03:09.690 --> 00:03:12.960 So how do we know that the six  regression assumptions actually hold. 00:03:12.960 --> 00:03:15.780 The assumptions are shown here,  00:03:15.780 --> 00:03:20.400 the assumptions are that all  relationships are linear. 00:03:20.400 --> 00:03:23.370 So it's a linear model,  observations are independent. 00:03:23.370 --> 00:03:28.080 So independence of observation  comes from our research design. 00:03:28.080 --> 00:03:31.350 And in cross-sectional study,  it is difficult to test. 00:03:31.350 --> 00:03:33.660 If you have a longitudinal study,  00:03:33.660 --> 00:03:36.960 then you can do some checks for  independence of observations. 00:03:36.960 --> 00:03:41.490 No perfect collinearity and non-zero  variances of independent variables. 00:03:41.490 --> 00:03:47.880 That happens if two or more variables  perfectly determine one another. 00:03:47.880 --> 00:03:50.970 So if you have a categorical  variable of three categories,  00:03:50.970 --> 00:03:54.750 then including three dummies  leads to this problem,  00:03:54.750 --> 00:03:57.750 because once you know two dummies, you know the third value. 00:03:57.750 --> 00:04:02.070 Also non-zero variance, if you have zero variance,  00:04:02.070 --> 00:04:04.560 for example, if you are studying   00:04:04.560 --> 00:04:09.870 the effects of gender, and you have no women in  the sample, then you have no variance in gender. 00:04:09.870 --> 00:04:11.520 So that is another implication. 00:04:12.060 --> 00:04:14.010 Another reason why this could occur. 00:04:14.010 --> 00:04:17.670 We know that this is not a problem in our data. 00:04:17.670 --> 00:04:19.560 Because if it was a problem,  00:04:19.560 --> 00:04:25.620 we couldn't even estimate the regression model, because we got regression estimates that indicates   00:04:25.620 --> 00:04:27.700 that we don't have problem  with the third assumption. 00:04:27.700 --> 00:04:32.260 The other assumptions are a bit more problematic,  00:04:32.260 --> 00:04:35.980 because they are about the error  term and we can't observe their term. 00:04:35.980 --> 00:04:41.230 So the fourth assumption was the  terms expected value of zero given   00:04:41.230 --> 00:04:45.100 any values of independent variables, then error term has equal variance,  00:04:45.100 --> 00:04:48.880 this is the homoskedasticity assumption and  then the error term is normally distributed. 00:04:48.880 --> 00:04:54.640 How we test these assumptions about the  error term, these three assumptions,  00:04:54.640 --> 00:04:59.110 is that we use the residuals  as estimates of the error term. 00:04:59.110 --> 00:05:03.880 So if observation is far from the  regression line in the population,  00:05:03.880 --> 00:05:05.530 there's a large value of the error term,  00:05:05.530 --> 00:05:10.510 then we can expect that it  also has a large residual. 00:05:10.510 --> 00:05:13.600 So we can use the residuals  as estimates of error terms. 00:05:13.600 --> 00:05:18.220 So normally doing regression in  diagnostics is analyzing the residuals. 00:05:18.220 --> 00:05:19.840 And that's quite natural. 00:05:19.840 --> 00:05:24.520 Because if you think that residual is a part  of the data that the model doesn't explain,  00:05:24.520 --> 00:05:31.000 and the idea of diagnostics is that we check  if the model explains the data adequately,  00:05:31.000 --> 00:05:34.330 then it's quite natural to  look at the part of the data,  00:05:34.330 --> 00:05:38.770 the model doesn't explain for  clues of what could go wrong. 00:05:38.770 --> 00:05:45.190 I normally start with, the  normal Q-Q plot of the residuals. 00:05:45.190 --> 00:05:50.110 And the normal Q-Q plot is something that   00:05:50.110 --> 00:05:53.890 quantifies whether the regression and  residuals are normally distributed. 00:05:53.890 --> 00:06:03.220 So it compares, the residuals here, or these  calculated based on standardized residuals,  00:06:03.220 --> 00:06:07.480 there are different kinds of residuals, for an applied researcher,  00:06:07.480 --> 00:06:10.000 it doesn't really matter if we know them all. 00:06:10.000 --> 00:06:15.130 What's important is that your  software will calculate the   00:06:15.130 --> 00:06:18.070 right kind of residual for you automatically. 00:06:18.910 --> 00:06:20.080 When you do these plots. 00:06:20.080 --> 00:06:25.600 Then you have normal distributions, we're  comparing residuals against normal distribution,  00:06:25.600 --> 00:06:30.670 we can see here that they, roughly correspond. 00:06:30.670 --> 00:06:34.210 So we have a line here indicates that  residuals are normally distributed. 00:06:34.210 --> 00:06:38.410 Here's the problem are we have a  chi-square distributed or two here.  00:06:38.410 --> 00:06:44.980 So the residuals here, are further  from mean than they're supposed to be. 00:06:44.980 --> 00:06:48.880 And here we have inverse, we have  uniform distribution of the error term.  00:06:48.880 --> 00:06:53.080 And that creates this kind of S  shape in the in the normal Q-Q plot. 00:06:53.080 --> 00:06:59.230 While the normality of the error term is not  an important assumption in regression analysis, 00:06:59.230 --> 00:07:05.200 I nevertheless do this because it usually, it's quick to do and it identifies outliers   00:07:05.200 --> 00:07:08.560 for me, and it gives me a kind of like, a first look at the data. 00:07:08.560 --> 00:07:12.280 Here with the actual data,  00:07:12.280 --> 00:07:16.510 I can see that the residuals  follow normal distribution. 00:07:16.510 --> 00:07:18.190 So I'm happy with this,  00:07:18.190 --> 00:07:25.780 this is an indication of a good fitting model, on if we think they are the sixth assumption. 00:07:25.780 --> 00:07:29.680 R labels these possible outliers. 00:07:29.680 --> 00:07:32.530 So newsboys has a large negative residual. 00:07:32.530 --> 00:07:37.390 So newsboys is less prestigious  than with the model predicts. 00:07:37.390 --> 00:07:41.590 And farmers are more prestigious, what the model predicts. 00:07:41.590 --> 00:07:43.810 So farmers don't make much money. 00:07:43.810 --> 00:07:46.540 And you don't need high education to be a farmer. 00:07:46.540 --> 00:07:48.670 But farmers are still appreciated a lot. 00:07:48.670 --> 00:07:52.840 So that's an other extreme case. 00:07:52.840 --> 00:07:57.970 So normal QQ plot shows that  the residuals are roughly   00:07:57.970 --> 00:08:00.040 normally distributed, and that's a good thing. 00:08:00.040 --> 00:08:06.370 So we conclude no problems, then we start looking at more complicated plots. 00:08:06.370 --> 00:08:09.340 The next plot is the residual versus fitted plot. 00:08:09.340 --> 00:08:13.060 And the idea of residual versus  fitted plot is that it allows us   00:08:13.060 --> 00:08:17.170 to test for nonlinearities and  heteroskedacity in the data. 00:08:17.170 --> 00:08:21.670 So the fitted value is calculated  based on the regression equation. 00:08:21.670 --> 00:08:27.250 We multiply these are variables  with the regression coefficients,   00:08:27.250 --> 00:08:30.280 and then we compare residual versus fitted. 00:08:30.280 --> 00:08:35.830 Ideally, there is no pattern here, the residuals and fitted values,  00:08:35.830 --> 00:08:37.930 they are just spread out. 00:08:37.930 --> 00:08:41.260 So this is an indication of a well fitting model. 00:08:41.260 --> 00:08:44.560 In this regard, here we have  a heteroskedasticity problem. 00:08:44.560 --> 00:08:51.070 So that plot contains data where  the variation of the residual,  00:08:51.070 --> 00:08:55.780 and also there is an error term, he saw a lot less here in the middle,  00:08:55.780 --> 00:08:58.660 and then it opens up to the left and to the right. 00:08:58.660 --> 00:09:01.090 So this is a butterfly shape of residuals. 00:09:01.090 --> 00:09:05.440 And this is the worst kind of  heteroskedacity problem that you could have. 00:09:05.440 --> 00:09:07.630 But it's not very, very realistic,  00:09:07.630 --> 00:09:12.790 because it's difficult to think of what kind  of process will generate this kind of data. 00:09:12.790 --> 00:09:19.390 Then here, we have a nonlinearity  and some heteroskedasticity problems. 00:09:19.390 --> 00:09:21.550 So this is a megaphone opening, right? 00:09:21.550 --> 00:09:25.270 And it appears that there's  slight nonlinearity here, 00:09:25.270 --> 00:09:27.610 we have here severe nonlinearity. 00:09:27.610 --> 00:09:32.890 So the right formula, right shape is not line, but it's a curve here. 00:09:32.890 --> 00:09:39.250 And this is a weird looking dataset  that has a nonlinearity problem. 00:09:39.250 --> 00:09:42.700 And also it has a heteroskedasticity problem. 00:09:42.700 --> 00:09:44.800 So the plot,  00:09:44.800 --> 00:09:48.730 we want to have something that looks  like that no particular pattern. 00:09:48.730 --> 00:09:54.460 So typically, in these diagnostic plots, that plot residual against something else,  00:09:54.460 --> 00:09:56.320 you are looking for an old pattern. 00:09:56.320 --> 00:10:00.430 Our residual versus fitted plot looks like that. 00:10:00.430 --> 00:10:07.120 So, we have marked again, these observations  with high residuals in absolute value. 00:10:07.120 --> 00:10:11.620 And then we can see that we have fitted values,  00:10:11.620 --> 00:10:17.050 there are very few or a few professions for  which the model predicts high prestigiuousness. 00:10:17.050 --> 00:10:21.670 And most observations are between 30 and 70. 00:10:21.670 --> 00:10:24.760 So what can we infer from this plot,  00:10:24.760 --> 00:10:31.480 we can infer that maybe the variance of the  estimates decreases slightly to the right. 00:10:31.480 --> 00:10:34.000 So we don't have much observations here. 00:10:34.000 --> 00:10:38.020 So we don't know if this is  actually the same dispersion   00:10:38.020 --> 00:10:41.200 here, but we just observe two  values from that dispersion. 00:10:41.200 --> 00:10:45.760 But it is possible that, if you look at this, this person here   00:10:46.870 --> 00:10:51.760 that much and look at this person here, it's a slightly less, 00:10:51.760 --> 00:10:54.550 so it is possible that we have  heteroskedasticity problem. 00:10:54.550 --> 00:10:59.830 So the fifth assumption does not  hold whether that is severe enough   00:10:59.830 --> 00:11:03.610 to warrant using the heteroskedacity  as this is robust, standard errors. 00:11:03.610 --> 00:11:04.960 That is a bit unclear,  00:11:04.960 --> 00:11:10.420 because this is not clear case  of where we should use those. 00:11:10.420 --> 00:11:12.700 Then we check for outliers. 00:11:12.700 --> 00:11:19.360 So this far, we have been looking for  evidence for heteroskedacity and nonlinearity. 00:11:19.360 --> 00:11:23.560 We have found evidence for heteroskedasticity, but not really for nonlinearities. 00:11:23.560 --> 00:11:28.150 Then we are looking for outliers as  the final step using the fourth plot. 00:11:28.150 --> 00:11:36.670 And the residual versus leverage plot, tells us which observations are influenced. 00:11:36.670 --> 00:11:42.790 So we're looking here at observations that  have a high leverage and high residual. 00:11:42.790 --> 00:11:51.040 So we have general managers who have high  leverage and a high residual in absolute value. 00:11:51.040 --> 00:11:53.410 So we want to look for observations,   00:11:53.410 --> 00:11:57.970 with residual that are larged in  absolute value, absolute magnitude. 00:11:57.970 --> 00:12:06.670 In Stata, for example, Stata uses the squared  residual here, because that always goes up. 00:12:06.670 --> 00:12:10.750 So it's easier to see which  observations have large residuals,  00:12:10.750 --> 00:12:11.290 so we can,  00:12:11.290 --> 00:12:16.510 we have to look at small negative values, or large positive values here. 00:12:16.510 --> 00:12:21.370 So it's not as simple as if it was  if this was square of the residual. 00:12:21.370 --> 00:12:27.400 So minister has leverage, newsboys has a large residual,  00:12:27.400 --> 00:12:32.200 then general managers is here. The cooks distance is another   00:12:32.200 --> 00:12:40.240 measure of influence and observations with  large, cooks distance are potential outliers. 00:12:40.240 --> 00:12:45.760 As before, in the Deephouse paper  to deal with these outliers,  00:12:45.760 --> 00:12:53.230 we will be looking at why the prestigiousness of  one occupation would be different than others. 00:12:53.230 --> 00:12:55.210 So for example, general managers,  00:12:55.780 --> 00:13:00.070 they earn a lot of money, so their salaries are high. 00:13:00.070 --> 00:13:04.330 And therefore their predictive  prestigiousness should be high   00:13:04.330 --> 00:13:06.640 as well, because it depends on the income. 00:13:06.640 --> 00:13:10.330 And they earn less than what the model predicts,  00:13:10.330 --> 00:13:16.180 which means that the model over predicts their  prestigiousness because of the high income. 00:13:16.180 --> 00:13:22.370 So that could be one reason why  we could drop general managers,  00:13:22.370 --> 00:13:27.440 but you have to use your own judgment, because this is 102 observations. 00:13:27.440 --> 00:13:32.930 So dropping one observation increases  our sample size by 1%, approximately. 00:13:32.930 --> 00:13:36.260 So that could be consequencial. 00:13:36.260 --> 00:13:40.970 So we have the leverage the distance  from the mass center of the data,  00:13:40.970 --> 00:13:41.780 conceptually,  00:13:41.780 --> 00:13:45.410 and cooks distance is  another measure of influence. 00:13:45.410 --> 00:13:51.860 So we identify outliers using this plot, then we start looking at the final plot,  00:13:51.860 --> 00:13:54.890 which is the added-variable plot. 00:13:54.890 --> 00:13:59.840 So added-variable plot quantifies the  relationship within the dependent variable,  00:13:59.840 --> 00:14:02.690 and one independent variable at a time. 00:14:02.690 --> 00:14:08.930 And this plot is interesting, it tells us plots,  00:14:08.930 --> 00:14:16.730 education, that is the focal independent variable regressed on the other independent variables here,  00:14:16.730 --> 00:14:19.280 the others, and it takes the residual. 00:14:19.280 --> 00:14:26.570 So this is the part of education that is  not explained by income or share of women. 00:14:26.570 --> 00:14:29.540 So that's, if you think about the Venn diagram,  00:14:29.540 --> 00:14:34.310 presentation of regression analysis, this is the part of the education  00:14:34.310 --> 00:14:38.120 that does not overlap with any of  the other independent variables. 00:14:38.120 --> 00:14:44.240 Then we have prestige, the regression of  prestige on other independent variables  00:14:44.240 --> 00:14:48.830 and we take the residual. So we take what is unique of prestige,  00:14:48.830 --> 00:14:54.950 and unique of education after parceling out the  influence of all other variables in the model. 00:14:54.950 --> 00:14:57.980 And then we draw a line through that beta. 00:14:57.980 --> 00:15:05.210 And this is actually the regression  line of prestige on education. 00:15:05.210 --> 00:15:10.220 So one way to calculate regression line  is to regress both variables independent   00:15:10.220 --> 00:15:12.710 and dependent on all other independent variables.   00:15:12.710 --> 00:15:16.880 And then run the regression analysis  using just one independent variable. 00:15:16.880 --> 00:15:18.920 It produces the exact same result,  00:15:18.920 --> 00:15:22.130 as would producing, including this   00:15:22.130 --> 00:15:26.600 education with all of the other variables  directly in multiple regression analysis. 00:15:26.600 --> 00:15:32.870 This plot allows us to look for nonlinearities  and heteroskedacity in a more refined manner. 00:15:32.870 --> 00:15:41.510 So what we can identify from here is that  the effects of income look pretty weird. 00:15:41.510 --> 00:15:46.400 We want to have observations that are  banded as a band around the regression line. 00:15:46.400 --> 00:15:50.900 And here you can see that it  looks more like a bit of a curve,  00:15:50.900 --> 00:15:54.080 it goes up here and then flattens out a bit. 00:15:54.080 --> 00:15:59.150 And we also have much more  dispersion here than dispersion here. 00:15:59.150 --> 00:16:01.640 Now, we have done the diagnostics. 00:16:01.640 --> 00:16:07.220 So we did all the normal Q-Q plot, then we did the residual versus fitted plot,  00:16:07.220 --> 00:16:12.710 we did they are influence plot or the  outlier plot, and added variable plot. 00:16:12.710 --> 00:16:15.950 And now we have to decide what  do we want to do with the model. 00:16:15.950 --> 00:16:22.250 And some ideas that we could try is to use  heteroskedasticity robust standard errors,  00:16:22.250 --> 00:16:23.990 our sample size is so small,  00:16:23.990 --> 00:16:28.670 and there is no clear evidence of  serious heteroskedasticity problem. 00:16:28.670 --> 00:16:34.670 So in this case, I would probably use  the normal conventional standard errors,  00:16:34.670 --> 00:16:39.260 consider dropping general managers  and see if the results change. 00:16:39.260 --> 00:16:42.710 Even if we decide to keep general managers in our   00:16:42.710 --> 00:16:45.860 sample, that could work as a  robustness check in the paper. 00:16:45.860 --> 00:16:49.490 So in the Deephouse's paper, they estimated the same model   00:16:49.490 --> 00:16:53.630 with the one outlier observation and without  the outlier and then compare the results. 00:16:53.630 --> 00:16:57.230 And we should consider log  transformation of income, 00:16:57.230 --> 00:17:02.270 consider the income in relative  terms makes a lot more sense anyway. 00:17:02.270 --> 00:17:07.430 Because when you think of all races, for example, or you want to switch to a new job,  00:17:07.430 --> 00:17:13.160 then you typically want to negotiate a  salary increase relative your current level. 00:17:13.160 --> 00:17:15.170 Also additional salary,  00:17:15.170 --> 00:17:22.415 how much it increases your quality of  life depends on the current salary level. 00:17:22.415 --> 00:17:28.940 So if you give 1000 euros to somebody who makes  1000 euros per month, that's a big difference. 00:17:28.940 --> 00:17:33.710 If you give 1000 euros to somebody  who makes 5000 euros a month,  00:17:33.710 --> 00:17:35.300 it's a smaller difference. 00:17:35.300 --> 00:17:38.810 So income company revenues,  00:17:38.810 --> 00:17:42.590 that kind of quantities we typically  want to consider in relative terms. 00:17:42.590 --> 00:17:45.200 And to do that we use the log transformers.