WEBVTT
Kind: captions
Language: en

00:00:01.306 --> 00:00:05.000
Regression analysis will give you 
estimates of regression coefficients

00:00:05.000 --> 00:00:08.381
and statistical tests of, 
whether those coefficients are

00:00:08.381 --> 00:00:10.200
different from zero in the population.

00:00:10.820 --> 00:00:15.780
Sometimes, however, it is very useful 
to be able to test other hypotheses.

00:00:15.780 --> 00:00:20.809
For example, if a coefficient 
differs from a value other than 0

00:00:20.809 --> 00:00:24.188
or if two coefficients are 
the same in the population.

00:00:24.507 --> 00:00:26.520
To do that we need to understand,

00:00:26.520 --> 00:00:30.240
how we test a linear hypothesis 
after regression analysis.

00:00:31.090 --> 00:00:33.520
So let's take an example of

00:00:33.520 --> 00:00:36.422
regression on prestige,

00:00:36.422 --> 00:00:39.420
on education, women and type of occupation,

00:00:39.420 --> 00:00:42.342
using the prestige data that 
we have been using before.

00:00:42.661 --> 00:00:44.700
So we get some regression estimates and

00:00:44.700 --> 00:00:47.276
we'll be focusing on these dummy variables.

00:00:47.276 --> 00:00:51.300
So the effects of professional 
and white color here tell,

00:00:51.300 --> 00:00:53.007
what is the difference between,

00:00:53.007 --> 00:00:57.870
or the expected difference between professional 
occupations and blue color occupations,

00:00:57.870 --> 00:01:00.930
and white color occupations 
and blue color occupations.

00:01:01.213 --> 00:01:04.603
So the regression coefficients 
here are differences

00:01:04.603 --> 00:01:06.906
related to a reference category,

00:01:06.906 --> 00:01:08.589
which is the blue color.

00:01:09.174 --> 00:01:13.665
However, sometimes knowing the difference

00:01:13.665 --> 00:01:18.420
between the categories and a 
reference, a category is not enough.

00:01:18.774 --> 00:01:20.664
What if we wanted to know,

00:01:20.664 --> 00:01:23.670
what's the difference between 
professional and white color,

00:01:23.670 --> 00:01:26.370
and is that statistically significant?

00:01:26.583 --> 00:01:27.843
The difference between

00:01:27.860 --> 00:01:29.660
professional and white color occupations

00:01:29.660 --> 00:01:31.748
is simply the sum of these two estimates,

00:01:31.748 --> 00:01:33.945
so it's about 10,

00:01:33.945 --> 00:01:36.930
but is that difference statistically significant?

00:01:36.930 --> 00:01:39.428
So we need to get a p-value.

00:01:39.747 --> 00:01:44.070
We can see that the p-value for professionals

00:01:44.070 --> 00:01:49.895
is about -0.08 for an estimate of 7,

00:01:49.895 --> 00:01:51.695
and based on that,

00:01:51.695 --> 00:01:55.770
considering that the difference between 
professionals and blue collars is 10,

00:01:55.770 --> 00:02:00.600
we could conclude that maybe the 
difference of 10 is significant,

00:02:00.600 --> 00:02:02.910
when a difference of 7 is close to significant.

00:02:02.910 --> 00:02:06.390
However, we need to do a proper test to assess

00:02:06.390 --> 00:02:07.387
whether that's the case.

00:02:08.167 --> 00:02:10.920
To do that we use the Wald test.

00:02:11.841 --> 00:02:14.070
And here the Wald test,

00:02:14.070 --> 00:02:17.124
the null hypothesis that I have in mind is that,

00:02:17.124 --> 00:02:21.120
the type professionals coefficient 
is the same as the type white color.

00:02:21.634 --> 00:02:24.304
To calculate the Wald test,

00:02:24.304 --> 00:02:29.340
we have to take an estimate squared, 
divide it by standard error squared.

00:02:29.872 --> 00:02:31.058
So how do we do that,

00:02:31.058 --> 00:02:32.884
we have to define, what is the estimate here?

00:02:33.043 --> 00:02:34.813
And what is the standard error here?

00:02:34.813 --> 00:02:37.370
To define the estimate,

00:02:37.370 --> 00:02:40.980
we will now write the null hypothesis 
in a slightly different way.

00:02:41.494 --> 00:02:43.144
So we'll write it that way.

00:02:43.144 --> 00:02:45.900
So if type professional equals type white color,

00:02:45.900 --> 00:02:48.570
and then type professional minus type white color,

00:02:48.570 --> 00:02:49.849
equals zero.

00:02:50.000 --> 00:02:51.920
So we have something here,

00:02:51.920 --> 00:02:54.600
that we compare against zero in the population.

00:02:54.901 --> 00:02:56.611
So this is our estimate,

00:02:56.611 --> 00:03:00.060
what is the estimated difference of 
type professionals, type white colors,

00:03:00.060 --> 00:03:02.970
and then we raise it to the second power.

00:03:03.502 --> 00:03:05.242
So that's easy enough.

00:03:05.242 --> 00:03:07.080
How about the standard error squared?

00:03:07.381 --> 00:03:08.791
We have to understand,

00:03:08.791 --> 00:03:10.333
what does the standard error quantify?

00:03:10.635 --> 00:03:13.155
So the standard error quantifies,

00:03:13.155 --> 00:03:17.550
it's the estimate of the standard 
deviation of this estimate,

00:03:17.550 --> 00:03:21.720
if we repeat the sample, the 
same random sample over and over

00:03:21.720 --> 00:03:22.871
from the same population.

00:03:22.871 --> 00:03:29.434
So how much this estimate varies 
because of sampling fluctuations?

00:03:30.780 --> 00:03:36.217
In our case the standard error squared

00:03:36.217 --> 00:03:38.700
is the estimated standard deviation squared,

00:03:39.161 --> 00:03:42.221
and standard deviation squared 
is the same as the variance.

00:03:42.221 --> 00:03:48.270
So we have estimate squared time 
divided by the variance of the estimate.

00:03:48.837 --> 00:03:51.957
So how do we calculate the 
variance of the estimate now?

00:03:51.957 --> 00:03:53.173
We have the estimate,

00:03:53.173 --> 00:03:56.113
which is the type professional 
minus type white color.

00:03:56.219 --> 00:03:57.920
We can plug in these numbers,

00:03:57.920 --> 00:03:59.524
we get about -10,

00:03:59.524 --> 00:04:01.650
and we raise it to the second power,

00:04:01.650 --> 00:04:03.403
we get about 100,

00:04:03.403 --> 00:04:06.960
and then we divide it by the 
variance of that estimate.

00:04:06.960 --> 00:04:08.491
But how do we do that?

00:04:09.537 --> 00:04:10.950
We need this kind of equation,

00:04:10.950 --> 00:04:13.350
so that's the estimate, that's easy enough.

00:04:13.350 --> 00:04:17.610
And when we have, the difference 
between two variables,

00:04:17.610 --> 00:04:19.530
type professional and type white color,

00:04:19.530 --> 00:04:20.550
they both vary.

00:04:20.550 --> 00:04:23.730
Then the variance of this difference is

00:04:23.730 --> 00:04:27.150
the variance of both variables summed,

00:04:27.150 --> 00:04:30.990
minus two times the covariance 
between these two variables.

00:04:31.681 --> 00:04:35.400
You can check the covariance 
calculation rule in this Wikipedia link,

00:04:35.683 --> 00:04:38.940
or a favorite regression 
book, if it's a good book,

00:04:38.940 --> 00:04:41.909
will also explain, how covariances are calculated.

00:04:42.777 --> 00:04:45.750
So we know the type professional variation

00:04:45.750 --> 00:04:48.270
and type white color variation,

00:04:48.270 --> 00:04:49.350
those are the standard errors.

00:04:50.000 --> 00:04:51.995
But what's this term here,

00:04:51.995 --> 00:04:53.790
this covariance between estimates.

00:04:54.251 --> 00:05:00.030
We can think of that covariance 
between these two estimates as,

00:05:00.030 --> 00:05:06.778
what will happen if the blue color occupations,

00:05:06.778 --> 00:05:10.000
that we use as a reference category.

00:05:10.000 --> 00:05:13.472
What if the prestige of those is a bit lower?

00:05:13.472 --> 00:05:17.719
So if the blue color occupations 
prestigiousness is a bit lower,

00:05:17.719 --> 00:05:21.433
it means that both, type 
professional and type white colors,

00:05:21.433 --> 00:05:26.100
which are evaluated against the 
blue colors prestigiousness,

00:05:26.100 --> 00:05:27.806
both increase a bit.

00:05:28.160 --> 00:05:36.390
So when these two estimates 
vary in over repeated samples,

00:05:36.390 --> 00:05:39.109
then they will also covary.

00:05:39.534 --> 00:05:43.740
So they will be correlated in 
repeated samples most of the time.

00:05:45.335 --> 00:05:48.570
The covariance matrix of the estimates

00:05:48.570 --> 00:05:51.570
is something that the regression 
analysis will provide for you.

00:05:51.818 --> 00:05:59.299
And here is the covariance matrix 
for the estimates for our example.

00:05:59.299 --> 00:06:03.504
So square root of this variance 
here is the standard error,

00:06:03.504 --> 00:06:05.850
you can verify with your hand calculator.

00:06:05.850 --> 00:06:12.720
And the square root of this variance here is 
the standard error for type white colors.

00:06:13.181 --> 00:06:15.971
And here's the covariance 
between these two estimates.

00:06:15.971 --> 00:06:19.530
So this is something that the regression 
analysis software provides for you.

00:06:19.530 --> 00:06:21.630
You don't have to understand, how it's calculated.

00:06:22.073 --> 00:06:25.013
Then we take the numbers here,

00:06:25.013 --> 00:06:26.820
we plug them here to this equation,

00:06:26.820 --> 00:06:31.804
and we get an answer of 12.325.

00:06:32.832 --> 00:06:38.220
We compare that 12.325 against 
the chi-square distribution

00:06:38.220 --> 00:06:39.778
with one degree of freedom,

00:06:39.778 --> 00:06:42.527
or we compare them against 
the proper F-distribution,

00:06:42.527 --> 00:06:45.000
because this is regression analysis

00:06:45.000 --> 00:06:49.320
and we know regression analysis, 
how it behaves in small samples,

00:06:49.320 --> 00:06:52.020
if we didn't, we would use 
the chi-square distribution.

00:06:52.303 --> 00:06:56.100
So whether you use the F-distribution or 
the chi-square to compare this against,

00:06:56.100 --> 00:06:59.100
depends on the same consideration as

00:06:59.100 --> 00:07:02.280
whether you would you be using z test or t test.

00:07:02.581 --> 00:07:06.990
If you are using statistics that have 
only been proven in large samples,

00:07:06.990 --> 00:07:08.760
then you use the z test and chi-square.

00:07:08.760 --> 00:07:10.800
If you use statistics that we know,

00:07:10.800 --> 00:07:12.330
how they behave in small samples,

00:07:12.330 --> 00:07:14.700
then you use a t test and an F-test.

00:07:16.050 --> 00:07:18.360
But you don't have to check 
that from your statistics book,

00:07:18.360 --> 00:07:21.960
because your computer software will 
do all this calculation for you.

00:07:22.385 --> 00:07:26.100
So in R, we can just use linear hypothesis

00:07:26.100 --> 00:07:28.980
and then we specify the hypothesis here,

00:07:28.980 --> 00:07:32.130
the R will calculate the test statistic for you,

00:07:32.130 --> 00:07:35.700
12.325, which is the same we got here manually,

00:07:35.700 --> 00:07:40.890
and it'll give you the proper p-value 
against the proper F-distribution.

00:07:40.890 --> 00:07:43.170
So this is a highly significant difference.

00:07:43.702 --> 00:07:46.800
This kind of comparison is not only restricted in

00:07:46.800 --> 00:07:50.420
comparing two categories 
of a categorical variable.

00:07:50.420 --> 00:07:52.380
You can also do comparisons of work,

00:07:52.380 --> 00:07:58.078
for example, whether the effects of 
women and education are the same,

00:07:58.078 --> 00:08:02.337
or whether the effects of education 
is different from let's say five.

00:08:02.337 --> 00:08:06.030
But comparing two regression 
coefficients comes with a big caveat.

00:08:06.526 --> 00:08:08.386
It only makes sense,

00:08:08.386 --> 00:08:12.900
if those two regression coefficients 
are quantified two variables

00:08:12.900 --> 00:08:15.000
that are somehow comparable.

00:08:15.443 --> 00:08:17.453
So you can't really compare

00:08:17.772 --> 00:08:21.300
a number of years of education to share of women,

00:08:21.300 --> 00:08:22.890
so those are incomparable.

00:08:23.262 --> 00:08:28.200
In many cases, these kinds of 
comparisons don't make much sense.

00:08:28.643 --> 00:08:32.370
Here because we have a categorical variable

00:08:32.370 --> 00:08:33.390
with different categories,

00:08:33.390 --> 00:08:34.556
they are comparable.

00:08:34.556 --> 00:08:37.080
So these are categories of the same variable.

00:08:37.080 --> 00:08:38.723
It makes sense to compare,

00:08:38.940 --> 00:08:41.430
in some other scenarios, it doesn't.

00:08:41.430 --> 00:08:42.686
So you really have to think,

00:08:42.774 --> 00:08:45.084
does the comparison make sense,

00:08:45.084 --> 00:08:47.280
before you can do this kind of statistical test.

00:08:47.280 --> 00:08:50.550
Because your statistical software 
will do any test for you,

00:08:50.550 --> 00:08:53.460
it will not tell you whether the test makes sense,

00:08:53.460 --> 00:08:54.660
you have the think for yourself.