WEBVTT Kind: captions Language: en 00:00:01.306 --> 00:00:05.000 Regression analysis will give you  estimates of regression coefficients 00:00:05.000 --> 00:00:08.381 and statistical tests of,  whether those coefficients are 00:00:08.381 --> 00:00:10.200 different from zero in the population. 00:00:10.820 --> 00:00:15.780 Sometimes, however, it is very useful  to be able to test other hypotheses. 00:00:15.780 --> 00:00:20.809 For example, if a coefficient  differs from a value other than 0 00:00:20.809 --> 00:00:24.188 or if two coefficients are  the same in the population. 00:00:24.507 --> 00:00:26.520 To do that we need to understand, 00:00:26.520 --> 00:00:30.240 how we test a linear hypothesis  after regression analysis. 00:00:31.090 --> 00:00:33.520 So let's take an example of 00:00:33.520 --> 00:00:36.422 regression on prestige, 00:00:36.422 --> 00:00:39.420 on education, women and type of occupation, 00:00:39.420 --> 00:00:42.342 using the prestige data that  we have been using before. 00:00:42.661 --> 00:00:44.700 So we get some regression estimates and 00:00:44.700 --> 00:00:47.276 we'll be focusing on these dummy variables. 00:00:47.276 --> 00:00:51.300 So the effects of professional  and white color here tell, 00:00:51.300 --> 00:00:53.007 what is the difference between, 00:00:53.007 --> 00:00:57.870 or the expected difference between professional  occupations and blue color occupations, 00:00:57.870 --> 00:01:00.930 and white color occupations  and blue color occupations. 00:01:01.213 --> 00:01:04.603 So the regression coefficients  here are differences 00:01:04.603 --> 00:01:06.906 related to a reference category, 00:01:06.906 --> 00:01:08.589 which is the blue color. 00:01:09.174 --> 00:01:13.665 However, sometimes knowing the difference 00:01:13.665 --> 00:01:18.420 between the categories and a  reference, a category is not enough. 00:01:18.774 --> 00:01:20.664 What if we wanted to know, 00:01:20.664 --> 00:01:23.670 what's the difference between  professional and white color, 00:01:23.670 --> 00:01:26.370 and is that statistically significant? 00:01:26.583 --> 00:01:27.843 The difference between 00:01:27.860 --> 00:01:29.660 professional and white color occupations 00:01:29.660 --> 00:01:31.748 is simply the sum of these two estimates, 00:01:31.748 --> 00:01:33.945 so it's about 10, 00:01:33.945 --> 00:01:36.930 but is that difference statistically significant? 00:01:36.930 --> 00:01:39.428 So we need to get a p-value. 00:01:39.747 --> 00:01:44.070 We can see that the p-value for professionals 00:01:44.070 --> 00:01:49.895 is about -0.08 for an estimate of 7, 00:01:49.895 --> 00:01:51.695 and based on that, 00:01:51.695 --> 00:01:55.770 considering that the difference between  professionals and blue collars is 10, 00:01:55.770 --> 00:02:00.600 we could conclude that maybe the  difference of 10 is significant, 00:02:00.600 --> 00:02:02.910 when a difference of 7 is close to significant. 00:02:02.910 --> 00:02:06.390 However, we need to do a proper test to assess 00:02:06.390 --> 00:02:07.387 whether that's the case. 00:02:08.167 --> 00:02:10.920 To do that we use the Wald test. 00:02:11.841 --> 00:02:14.070 And here the Wald test, 00:02:14.070 --> 00:02:17.124 the null hypothesis that I have in mind is that, 00:02:17.124 --> 00:02:21.120 the type professionals coefficient  is the same as the type white color. 00:02:21.634 --> 00:02:24.304 To calculate the Wald test, 00:02:24.304 --> 00:02:29.340 we have to take an estimate squared,  divide it by standard error squared. 00:02:29.872 --> 00:02:31.058 So how do we do that, 00:02:31.058 --> 00:02:32.884 we have to define, what is the estimate here? 00:02:33.043 --> 00:02:34.813 And what is the standard error here? 00:02:34.813 --> 00:02:37.370 To define the estimate, 00:02:37.370 --> 00:02:40.980 we will now write the null hypothesis  in a slightly different way. 00:02:41.494 --> 00:02:43.144 So we'll write it that way. 00:02:43.144 --> 00:02:45.900 So if type professional equals type white color, 00:02:45.900 --> 00:02:48.570 and then type professional minus type white color, 00:02:48.570 --> 00:02:49.849 equals zero. 00:02:50.000 --> 00:02:51.920 So we have something here, 00:02:51.920 --> 00:02:54.600 that we compare against zero in the population. 00:02:54.901 --> 00:02:56.611 So this is our estimate, 00:02:56.611 --> 00:03:00.060 what is the estimated difference of  type professionals, type white colors, 00:03:00.060 --> 00:03:02.970 and then we raise it to the second power. 00:03:03.502 --> 00:03:05.242 So that's easy enough. 00:03:05.242 --> 00:03:07.080 How about the standard error squared? 00:03:07.381 --> 00:03:08.791 We have to understand, 00:03:08.791 --> 00:03:10.333 what does the standard error quantify? 00:03:10.635 --> 00:03:13.155 So the standard error quantifies, 00:03:13.155 --> 00:03:17.550 it's the estimate of the standard  deviation of this estimate, 00:03:17.550 --> 00:03:21.720 if we repeat the sample, the  same random sample over and over 00:03:21.720 --> 00:03:22.871 from the same population. 00:03:22.871 --> 00:03:29.434 So how much this estimate varies  because of sampling fluctuations? 00:03:30.780 --> 00:03:36.217 In our case the standard error squared 00:03:36.217 --> 00:03:38.700 is the estimated standard deviation squared, 00:03:39.161 --> 00:03:42.221 and standard deviation squared  is the same as the variance. 00:03:42.221 --> 00:03:48.270 So we have estimate squared time  divided by the variance of the estimate. 00:03:48.837 --> 00:03:51.957 So how do we calculate the  variance of the estimate now? 00:03:51.957 --> 00:03:53.173 We have the estimate, 00:03:53.173 --> 00:03:56.113 which is the type professional  minus type white color. 00:03:56.219 --> 00:03:57.920 We can plug in these numbers, 00:03:57.920 --> 00:03:59.524 we get about -10, 00:03:59.524 --> 00:04:01.650 and we raise it to the second power, 00:04:01.650 --> 00:04:03.403 we get about 100, 00:04:03.403 --> 00:04:06.960 and then we divide it by the  variance of that estimate. 00:04:06.960 --> 00:04:08.491 But how do we do that? 00:04:09.537 --> 00:04:10.950 We need this kind of equation, 00:04:10.950 --> 00:04:13.350 so that's the estimate, that's easy enough. 00:04:13.350 --> 00:04:17.610 And when we have, the difference  between two variables, 00:04:17.610 --> 00:04:19.530 type professional and type white color, 00:04:19.530 --> 00:04:20.550 they both vary. 00:04:20.550 --> 00:04:23.730 Then the variance of this difference is 00:04:23.730 --> 00:04:27.150 the variance of both variables summed, 00:04:27.150 --> 00:04:30.990 minus two times the covariance  between these two variables. 00:04:31.681 --> 00:04:35.400 You can check the covariance  calculation rule in this Wikipedia link, 00:04:35.683 --> 00:04:38.940 or a favorite regression  book, if it's a good book, 00:04:38.940 --> 00:04:41.909 will also explain, how covariances are calculated. 00:04:42.777 --> 00:04:45.750 So we know the type professional variation 00:04:45.750 --> 00:04:48.270 and type white color variation, 00:04:48.270 --> 00:04:49.350 those are the standard errors. 00:04:50.000 --> 00:04:51.995 But what's this term here, 00:04:51.995 --> 00:04:53.790 this covariance between estimates. 00:04:54.251 --> 00:05:00.030 We can think of that covariance  between these two estimates as, 00:05:00.030 --> 00:05:06.778 what will happen if the blue color occupations, 00:05:06.778 --> 00:05:10.000 that we use as a reference category. 00:05:10.000 --> 00:05:13.472 What if the prestige of those is a bit lower? 00:05:13.472 --> 00:05:17.719 So if the blue color occupations  prestigiousness is a bit lower, 00:05:17.719 --> 00:05:21.433 it means that both, type  professional and type white colors, 00:05:21.433 --> 00:05:26.100 which are evaluated against the  blue colors prestigiousness, 00:05:26.100 --> 00:05:27.806 both increase a bit. 00:05:28.160 --> 00:05:36.390 So when these two estimates  vary in over repeated samples, 00:05:36.390 --> 00:05:39.109 then they will also covary. 00:05:39.534 --> 00:05:43.740 So they will be correlated in  repeated samples most of the time. 00:05:45.335 --> 00:05:48.570 The covariance matrix of the estimates 00:05:48.570 --> 00:05:51.570 is something that the regression  analysis will provide for you. 00:05:51.818 --> 00:05:59.299 And here is the covariance matrix  for the estimates for our example. 00:05:59.299 --> 00:06:03.504 So square root of this variance  here is the standard error, 00:06:03.504 --> 00:06:05.850 you can verify with your hand calculator. 00:06:05.850 --> 00:06:12.720 And the square root of this variance here is  the standard error for type white colors. 00:06:13.181 --> 00:06:15.971 And here's the covariance  between these two estimates. 00:06:15.971 --> 00:06:19.530 So this is something that the regression  analysis software provides for you. 00:06:19.530 --> 00:06:21.630 You don't have to understand, how it's calculated. 00:06:22.073 --> 00:06:25.013 Then we take the numbers here, 00:06:25.013 --> 00:06:26.820 we plug them here to this equation, 00:06:26.820 --> 00:06:31.804 and we get an answer of 12.325. 00:06:32.832 --> 00:06:38.220 We compare that 12.325 against  the chi-square distribution 00:06:38.220 --> 00:06:39.778 with one degree of freedom, 00:06:39.778 --> 00:06:42.527 or we compare them against  the proper F-distribution, 00:06:42.527 --> 00:06:45.000 because this is regression analysis 00:06:45.000 --> 00:06:49.320 and we know regression analysis,  how it behaves in small samples, 00:06:49.320 --> 00:06:52.020 if we didn't, we would use  the chi-square distribution. 00:06:52.303 --> 00:06:56.100 So whether you use the F-distribution or  the chi-square to compare this against, 00:06:56.100 --> 00:06:59.100 depends on the same consideration as 00:06:59.100 --> 00:07:02.280 whether you would you be using z test or t test. 00:07:02.581 --> 00:07:06.990 If you are using statistics that have  only been proven in large samples, 00:07:06.990 --> 00:07:08.760 then you use the z test and chi-square. 00:07:08.760 --> 00:07:10.800 If you use statistics that we know, 00:07:10.800 --> 00:07:12.330 how they behave in small samples, 00:07:12.330 --> 00:07:14.700 then you use a t test and an F-test. 00:07:16.050 --> 00:07:18.360 But you don't have to check  that from your statistics book, 00:07:18.360 --> 00:07:21.960 because your computer software will  do all this calculation for you. 00:07:22.385 --> 00:07:26.100 So in R, we can just use linear hypothesis 00:07:26.100 --> 00:07:28.980 and then we specify the hypothesis here, 00:07:28.980 --> 00:07:32.130 the R will calculate the test statistic for you, 00:07:32.130 --> 00:07:35.700 12.325, which is the same we got here manually, 00:07:35.700 --> 00:07:40.890 and it'll give you the proper p-value  against the proper F-distribution. 00:07:40.890 --> 00:07:43.170 So this is a highly significant difference. 00:07:43.702 --> 00:07:46.800 This kind of comparison is not only restricted in 00:07:46.800 --> 00:07:50.420 comparing two categories  of a categorical variable. 00:07:50.420 --> 00:07:52.380 You can also do comparisons of work, 00:07:52.380 --> 00:07:58.078 for example, whether the effects of  women and education are the same, 00:07:58.078 --> 00:08:02.337 or whether the effects of education  is different from let's say five. 00:08:02.337 --> 00:08:06.030 But comparing two regression  coefficients comes with a big caveat. 00:08:06.526 --> 00:08:08.386 It only makes sense, 00:08:08.386 --> 00:08:12.900 if those two regression coefficients  are quantified two variables 00:08:12.900 --> 00:08:15.000 that are somehow comparable. 00:08:15.443 --> 00:08:17.453 So you can't really compare 00:08:17.772 --> 00:08:21.300 a number of years of education to share of women, 00:08:21.300 --> 00:08:22.890 so those are incomparable. 00:08:23.262 --> 00:08:28.200 In many cases, these kinds of  comparisons don't make much sense. 00:08:28.643 --> 00:08:32.370 Here because we have a categorical variable 00:08:32.370 --> 00:08:33.390 with different categories, 00:08:33.390 --> 00:08:34.556 they are comparable. 00:08:34.556 --> 00:08:37.080 So these are categories of the same variable. 00:08:37.080 --> 00:08:38.723 It makes sense to compare, 00:08:38.940 --> 00:08:41.430 in some other scenarios, it doesn't. 00:08:41.430 --> 00:08:42.686 So you really have to think, 00:08:42.774 --> 00:08:45.084 does the comparison make sense, 00:08:45.084 --> 00:08:47.280 before you can do this kind of statistical test. 00:08:47.280 --> 00:08:50.550 Because your statistical software  will do any test for you, 00:08:50.550 --> 00:08:53.460 it will not tell you whether the test makes sense, 00:08:53.460 --> 00:08:54.660 you have the think for yourself.