WEBVTT

00:00:00.090 --> 00:00:03.120
We will now formalize the 
previous example a bit more,

00:00:03.120 --> 00:00:10.080
and we will discuss the concept of 
null hypothesis significance testing,

00:00:10.080 --> 00:00:12.870
or NHST, which is the acronym.

00:00:12.870 --> 00:00:18.000
The idea of null hypothesis 
significance testing is that,

00:00:18.690 --> 00:00:20.970
we start by defining,

00:00:20.970 --> 00:00:26.130
we have some kind of estimation problem 
that gives us some kind of estimate,

00:00:26.130 --> 00:00:28.590
and we define two things,

00:00:28.590 --> 00:00:33.510
we define a test statistic and 
then we define a null hypothesis.

00:00:33.510 --> 00:00:35.700
So we call the test statistic,

00:00:35.700 --> 00:00:37.410
we refer to it as a key,

00:00:37.410 --> 00:00:45.330
and then we need to have the sampling 
distribution of the T under the null hypothesis.

00:00:45.330 --> 00:00:50.190
The null hypothesis or H0 is typically

00:00:50.190 --> 00:00:52.950
a hypothesis that there is no effect,

00:00:52.950 --> 00:00:57.570
there is no correlation between 
CEO gender and profitability,

00:00:57.570 --> 00:01:02.220
or there is no difference between men 
and women-led companies on profitability.

00:01:02.220 --> 00:01:06.630
Then we derive, based on statistical theory,

00:01:06.630 --> 00:01:08.670
a reference distribution,

00:01:08.670 --> 00:01:14.370
so how would the test statistic be 
distributed if there was really no effect?

00:01:14.370 --> 00:01:22.590
Then we compare the test statistic calculated 
from our sample to the distribution,

00:01:22.590 --> 00:01:27.270
and we can see that, okay this 
area here gives us the p-value.

00:01:27.270 --> 00:01:31.440
So it is the probability of 
obtaining the test statistic  

00:01:31.440 --> 00:01:34.500
under the null hypothesis given our sample size.

00:01:34.500 --> 00:01:38.220
Then we compare the observed 
statistic to get the p-value.

00:01:38.220 --> 00:01:40.680
So that's the idea of null 
hypothesis significance testing.

00:01:40.680 --> 00:01:43.890
Typically this is done by a computer for you,

00:01:43.890 --> 00:01:47.070
so you don't have to draw this normal 
distribution or calculate the area,

00:01:47.070 --> 00:01:50.940
but it's useful to understand 
what's going on under the hood,

00:01:50.940 --> 00:01:56.070
so you know what kind of problems we 
face when we do this kind of inference.

00:01:56.070 --> 00:02:01.530
The simplest test perhaps, using the 
null hypothesis significance testing,

00:02:01.530 --> 00:02:04.080
is the t test,

00:02:04.080 --> 00:02:10.920
and the idea of a t test is that 
it assumes that the estimates

00:02:10.920 --> 00:02:13.920
are normally distributed over repeated samples.

00:02:13.920 --> 00:02:17.160
It was the case when we compared two means,

00:02:17.160 --> 00:02:21.420
so the difference of two means 
is normally distributed under,

00:02:21.420 --> 00:02:23.022
when the sample size is large enough.

00:02:23.022 --> 00:02:29.190
And then we have the estimate,

00:02:29.190 --> 00:02:34.876
then the test statistic is: estimate 
divided by the standard error.

00:02:35.289 --> 00:02:36.660
So instead of looking at

00:02:36.660 --> 00:02:40.440
how far the estimate is from the 
null hypothesis value of zero,

00:02:40.440 --> 00:02:46.200
we look at, how far the estimate, divided 
by the standard error, is from zero.

00:02:47.730 --> 00:02:51.510
And this follows students' t distributions,

00:02:51.510 --> 00:02:53.040
that looks like a normal distribution,

00:02:53.040 --> 00:02:56.700
but it's a bit wider in small samples.

00:02:56.700 --> 00:03:02.370
The idea of a t test or this estimate 
divided by the standard error,

00:03:02.370 --> 00:03:05.190
is that we standardize the estimate.

00:03:05.190 --> 00:03:13.920
So remember standardization is 
subtracting the mean of the estimates,

00:03:13.920 --> 00:03:17.310
so here we assume the mean 
to be the null hypothesis,

00:03:17.310 --> 00:03:20.310
so we subtract zero and it 
doesn't really make a difference,

00:03:20.310 --> 00:03:23.100
and we divide by standard deviation,

00:03:23.100 --> 00:03:27.030
which case is estimated by standard error here.

00:03:27.030 --> 00:03:29.970
So the t statistic tells us,

00:03:29.970 --> 00:03:35.550
how far from zero the estimate 
is, on a standardized metric.

00:03:35.550 --> 00:03:40.800
If it's more than two 
standard deviations from zero,

00:03:40.800 --> 00:03:42.750
then we conclude that,

00:03:42.750 --> 00:03:47.460
that kind of observation will be 
unlikely to occur, by chance only,

00:03:47.460 --> 00:03:54.750
because 95% of the observations fall within 
plus or minus two standard deviations,

00:03:54.750 --> 00:04:00.937
when we have normally distributed statistic.

00:04:00.937 --> 00:04:03.180
So, we compare this area,

00:04:03.180 --> 00:04:07.410
in practice, it often makes 
sense to compare both areas here,

00:04:07.410 --> 00:04:09.150
so we calculate this area as well.

00:04:09.150 --> 00:04:10.740
The logic being that,

00:04:11.250 --> 00:04:16.140
it will be an important finding if the 
difference was to the other direction as well.

00:04:16.140 --> 00:04:21.300
And this relates to, or is referred 
to as one and two-tailed tests.

00:04:21.300 --> 00:04:23.580
So, what area we compare?

00:04:23.580 --> 00:04:28.820
So normally, if we only compare 
one end of the normal distribution here,

00:04:28.820 --> 00:04:30.800
this is called a one-tailed test.

00:04:30.800 --> 00:04:34.760
And if we compare the area,

00:04:34.760 --> 00:04:38.270
what is the five percent 
area here and here together?

00:04:38.270 --> 00:04:43.190
So this is 2.5 % and this 
is 2.5 % so they sum to 5 %,

00:04:43.190 --> 00:04:45.529
then that's called a two-tailed test.

00:04:46.396 --> 00:04:52.400
Normally, when your statistical software 
gives you a p-value from a t test,

00:04:52.400 --> 00:04:56.210
or some other test that uses something 
that looks like a normal distribution,

00:04:56.210 --> 00:04:58.280
for example a z test,

00:04:58.280 --> 00:05:01.820
then it is two tails, so you compare both ends.

00:05:01.820 --> 00:05:06.320
And it's considered cheating 
to use the one-tailed test,

00:05:06.320 --> 00:05:09.530
because what the one-tailed test basically does,

00:05:09.530 --> 00:05:16.730
it gives you a p-value that is exactly half 
of the p-value of the two-tailed test.

00:05:16.730 --> 00:05:18.680
Because you have two tails here,

00:05:18.680 --> 00:05:21.350
the probability of the observation being in both.

00:05:21.350 --> 00:05:24.470
Either tail is twice as the probability here,

00:05:24.470 --> 00:05:28.400
so the probability here is half from 
what the probability here would be.

00:05:28.400 --> 00:05:32.090
The problem in one-tailed tests is that,

00:05:32.090 --> 00:05:39.770
the standard is to use two tails and if 
we observe a p-value in a research paper,

00:05:39.770 --> 00:05:45.410
we assume that it's made by 
using this two-tailed test.

00:05:45.410 --> 00:05:54.350
Sometimes if the p-value is 0.06 and a 
researcher wants it to be less than 0.05,

00:05:54.350 --> 00:05:56.480
they switch to one-tailed test,

00:05:56.480 --> 00:05:58.940
which allows them to divide the p-value by half,

00:05:58.940 --> 00:06:05.900
and they present those as if they 
were tests from the two-tailed test,

00:06:05.900 --> 00:06:08.960
that's misleading readers, that's unethical.

00:06:10.250 --> 00:06:15.320
There are basically no good reasons 
ever to use these one-tailed tests,

00:06:15.320 --> 00:06:18.890
because this is more commonly accepted and also,

00:06:18.890 --> 00:06:22.550
if someone wants to have the one-tailed 
test instead of a two-tailed test,

00:06:22.550 --> 00:06:25.100
they can just divide your p-values by two

00:06:25.100 --> 00:06:27.965
and that's what the difference is.

00:06:27.965 --> 00:06:32.480
The p-values are very commonly 
used in research papers.

00:06:32.480 --> 00:06:35.690
So you see papers, for example,

00:06:35.690 --> 00:06:37.160
this is from Hekman's paper,

00:06:37.160 --> 00:06:40.940
you see these p-values behind statistics,

00:06:40.940 --> 00:06:42.740
so you regression estimate here,

00:06:42.740 --> 00:06:45.290
and then there is p-value less than 0.01,

00:06:45.290 --> 00:06:47.300
that is statistically significant,

00:06:47.300 --> 00:06:50.630
you see this n.s, that means non-significant or,

00:06:50.630 --> 00:06:53.630
you can see p-value is greater than 0.05.

00:06:53.630 --> 00:07:00.860
So for some reason, we have decided 
that 5 % p-value is the gold standard,

00:07:00.860 --> 00:07:03.530
and if you have less than 5 %, then it's a good thing,

00:07:03.530 --> 00:07:05.900
if you have more than 5 %, that's a bad thing,

00:07:05.900 --> 00:07:07.460
so that's an arbitrary threshold.

00:07:07.460 --> 00:07:13.010
So, a paper could have 
hundreds of p-values easily,

00:07:13.010 --> 00:07:17.392
so it's a very commonly used in research articles.

00:07:18.569 --> 00:07:20.690
P-value relates to two different things,

00:07:20.690 --> 00:07:24.080
so it relates to two different errors.

00:07:24.080 --> 00:07:30.590
And we have two things in statistical analysis,

00:07:30.590 --> 00:07:31.640
we have the population,

00:07:31.640 --> 00:07:34.088
and we have the sample.

00:07:34.088 --> 00:07:40.700
We want to make an inference that 
something exists in the population

00:07:40.700 --> 00:07:43.386
using the sample data.

00:07:43.386 --> 00:07:46.258
So we calculate a test statistic,

00:07:46.258 --> 00:07:50.750
the test statistic rejects the 
null hypothesis in the sample,

00:07:50.750 --> 00:07:54.350
then we say that we assume it is,

00:07:54.350 --> 00:07:57.080
the null then doesn't hold in the population.

00:07:57.080 --> 00:08:00.950
But that's not actually always the case.

00:08:00.950 --> 00:08:03.890
When you get a p-value that is small,

00:08:03.890 --> 00:08:08.810
it's also possible that it 
is a false positive finding.

00:08:08.810 --> 00:08:12.860
So p less 0.05 means that,

00:08:12.860 --> 00:08:14.420
if there was no effect,

00:08:14.420 --> 00:08:19.520
then getting the kind of result 
that you just got would be,

00:08:19.520 --> 00:08:21.820
the probability for that would be 5%.

00:08:21.820 --> 00:08:26.840
So 1 out of 20 samples from a population,

00:08:26.840 --> 00:08:28.217
you would be getting a false positive,

00:08:28.217 --> 00:08:31.370
if the null hypothesis 
wouldn't hold, doesn't hold.

00:08:31.370 --> 00:08:33.980
So it's possible that it's false positive,

00:08:33.980 --> 00:08:38.390
but it's also possible that it's a true positive.

00:08:38.390 --> 00:08:40.370
So the problem is, we don't know.

00:08:40.370 --> 00:08:43.610
We have evidence that it would be unlikely that

00:08:43.610 --> 00:08:47.120
we would get an effect estimate by chance only.

00:08:47.120 --> 00:08:50.600
Then we conclude that maybe 
it wasn't by chance only,

00:08:50.600 --> 00:08:52.100
but we can't know for sure.

00:08:54.300 --> 00:08:56.640
So this is type 1 error,

00:08:56.640 --> 00:08:59.130
then we have type 2 error, 
which is a false negative.

00:08:59.130 --> 00:09:02.280
Let's say that the null hypothesis 
holds in the population,

00:09:02.280 --> 00:09:07.470
and let's say that women-led companies are 
really more profitable than men-led companies,

00:09:07.470 --> 00:09:13.500
but for some reason, our study 
couldn't find the difference.

00:09:13.500 --> 00:09:14.640
So that would be a false negative.

00:09:14.640 --> 00:09:17.490
And there is the case that we say that

00:09:17.490 --> 00:09:19.920
we can't reject the null hypothesis,

00:09:19.920 --> 00:09:23.490
we can't reject the claim 
that there's no difference,

00:09:23.490 --> 00:09:25.110
and there really is no difference,

00:09:25.110 --> 00:09:27.390
so that's also a valid finding.

00:09:27.390 --> 00:09:32.065
So we want to be sure that we either 
have true positives or two negatives.

00:09:32.582 --> 00:09:38.790
The probability of false positives 
under the null hypothesis is,

00:09:38.790 --> 00:09:42.810
we consider 5% or less acceptable.

00:09:42.810 --> 00:09:46.590
So if we say that the p-value is valid,

00:09:46.590 --> 00:09:50.119
then it should behave as expected.

00:09:50.119 --> 00:09:56.088
So it's okay for the p-value to be less than 0.05,

00:09:56.088 --> 00:09:58.415
3% of the time,

00:09:58.415 --> 00:10:02.610
if the null hypothesis holds in the population.

00:10:02.610 --> 00:10:06.630
So we have a conservative test, that's okay.

00:10:06.630 --> 00:10:10.200
So we want to make errors to be cautious.

00:10:10.200 --> 00:10:16.320
But if our p-value was less than 
0.05, let's say 7 % of the time,

00:10:16.320 --> 00:10:19.830
then you would say that it's too liberal and

00:10:19.830 --> 00:10:22.950
it's not the valid p-value 
for the particular test,

00:10:22.950 --> 00:10:25.620
because it doesn't follow 
the reference distribution.

00:10:25.620 --> 00:10:29.584
It's important that when the null hypothesis hold,

00:10:29.584 --> 00:10:33.810
our p-values don't indicate the support too often.

00:10:34.616 --> 00:10:37.800
Then we have another concept 
called statistical power,

00:10:37.800 --> 00:10:40.294
so this is a false positive rate.

00:10:40.294 --> 00:10:43.890
And statistical power is something that,

00:10:43.890 --> 00:10:49.650
once we have a statistic whose p-value 
doesn't exceed a false positive rate,

00:10:49.650 --> 00:10:53.160
we want the test to identify an effect,

00:10:53.160 --> 00:10:56.916
when it exists as frequently as possible.

00:10:57.144 --> 00:10:59.792
Typically we are ok with 80%,

00:10:59.792 --> 00:11:03.390
but there are studies with way less power.

00:11:03.390 --> 00:11:07.770
So 80% power means that when 
there is an effect the population

00:11:07.770 --> 00:11:12.300
then in four out of five studies, 
we would actually detect an effect.

00:11:12.300 --> 00:11:15.360
The question is, which one is more important?

00:11:15.360 --> 00:11:19.710
So we're not okay with more 
than 5 % false-positive rates,

00:11:19.710 --> 00:11:23.430
but we are okay with 20 % false-negative rates,

00:11:23.430 --> 00:11:26.520
because of 80 % power 20 % false negative.

00:11:26.520 --> 00:11:28.710
Then the reason why,

00:11:29.730 --> 00:11:33.420
we are so much more worried 
about false positives is that,

00:11:33.420 --> 00:11:36.870
positive effects typically have 
some kind of policy implications.

00:11:36.870 --> 00:11:42.480
If we find out that a medicine 
doesn't do us any good,

00:11:42.480 --> 00:11:44.850
then no one is going to take the 
medicine, we continue to research.

00:11:44.850 --> 00:11:48.270
If we find out that the medicine helps people,

00:11:48.270 --> 00:11:50.490
then people will start taking the medicine.

00:11:50.490 --> 00:11:53.490
If it's a false positive finding,

00:11:53.490 --> 00:11:57.750
then people will take a medicine that is useless,

00:11:57.750 --> 00:11:59.610
or could be even harmful to them.

00:11:59.610 --> 00:12:05.370
So false positives have poisonous implications much more often than false negatives,

00:12:05.370 --> 00:12:08.970
and that's the reason why we 
want to avoid false positives.

00:12:08.970 --> 00:12:11.850
We have agreed that it's okay to have a 5 % rate,

00:12:11.850 --> 00:12:15.840
hence p is less than 0.05, but not more.

00:12:16.470 --> 00:12:21.300
Of course, in some scenarios, if you 
have a really like a life-critical thing,

00:12:21.300 --> 00:12:27.060
then you could be using a p-value 
threshold of 0.001, for example.

00:12:27.060 --> 00:12:31.530
So 0.05 is not the one correct value,

00:12:31.530 --> 00:12:33.600
that's just the convention in many fields.

00:12:33.600 --> 00:12:35.970
Some other fields use smaller values and

00:12:35.970 --> 00:12:38.520
you can use smaller values in 
an individual study as well