WEBVTT 00:00:00.090 --> 00:00:03.120 We will now formalize the  previous example a bit more, 00:00:03.120 --> 00:00:10.080 and we will discuss the concept of  null hypothesis significance testing, 00:00:10.080 --> 00:00:12.870 or NHST, which is the acronym. 00:00:12.870 --> 00:00:18.000 The idea of null hypothesis  significance testing is that, 00:00:18.690 --> 00:00:20.970 we start by defining, 00:00:20.970 --> 00:00:26.130 we have some kind of estimation problem  that gives us some kind of estimate, 00:00:26.130 --> 00:00:28.590 and we define two things, 00:00:28.590 --> 00:00:33.510 we define a test statistic and  then we define a null hypothesis. 00:00:33.510 --> 00:00:35.700 So we call the test statistic, 00:00:35.700 --> 00:00:37.410 we refer to it as a key, 00:00:37.410 --> 00:00:45.330 and then we need to have the sampling  distribution of the T under the null hypothesis. 00:00:45.330 --> 00:00:50.190 The null hypothesis or H0 is typically 00:00:50.190 --> 00:00:52.950 a hypothesis that there is no effect, 00:00:52.950 --> 00:00:57.570 there is no correlation between  CEO gender and profitability, 00:00:57.570 --> 00:01:02.220 or there is no difference between men  and women-led companies on profitability. 00:01:02.220 --> 00:01:06.630 Then we derive, based on statistical theory, 00:01:06.630 --> 00:01:08.670 a reference distribution, 00:01:08.670 --> 00:01:14.370 so how would the test statistic be  distributed if there was really no effect? 00:01:14.370 --> 00:01:22.590 Then we compare the test statistic calculated  from our sample to the distribution, 00:01:22.590 --> 00:01:27.270 and we can see that, okay this  area here gives us the p-value. 00:01:27.270 --> 00:01:31.440 So it is the probability of  obtaining the test statistic   00:01:31.440 --> 00:01:34.500 under the null hypothesis given our sample size. 00:01:34.500 --> 00:01:38.220 Then we compare the observed  statistic to get the p-value. 00:01:38.220 --> 00:01:40.680 So that's the idea of null  hypothesis significance testing. 00:01:40.680 --> 00:01:43.890 Typically this is done by a computer for you, 00:01:43.890 --> 00:01:47.070 so you don't have to draw this normal  distribution or calculate the area, 00:01:47.070 --> 00:01:50.940 but it's useful to understand  what's going on under the hood, 00:01:50.940 --> 00:01:56.070 so you know what kind of problems we  face when we do this kind of inference. 00:01:56.070 --> 00:02:01.530 The simplest test perhaps, using the  null hypothesis significance testing, 00:02:01.530 --> 00:02:04.080 is the t test, 00:02:04.080 --> 00:02:10.920 and the idea of a t test is that  it assumes that the estimates 00:02:10.920 --> 00:02:13.920 are normally distributed over repeated samples. 00:02:13.920 --> 00:02:17.160 It was the case when we compared two means, 00:02:17.160 --> 00:02:21.420 so the difference of two means  is normally distributed under, 00:02:21.420 --> 00:02:23.022 when the sample size is large enough. 00:02:23.022 --> 00:02:29.190 And then we have the estimate, 00:02:29.190 --> 00:02:34.876 then the test statistic is: estimate  divided by the standard error. 00:02:35.289 --> 00:02:36.660 So instead of looking at 00:02:36.660 --> 00:02:40.440 how far the estimate is from the  null hypothesis value of zero, 00:02:40.440 --> 00:02:46.200 we look at, how far the estimate, divided  by the standard error, is from zero. 00:02:47.730 --> 00:02:51.510 And this follows students' t distributions, 00:02:51.510 --> 00:02:53.040 that looks like a normal distribution, 00:02:53.040 --> 00:02:56.700 but it's a bit wider in small samples. 00:02:56.700 --> 00:03:02.370 The idea of a t test or this estimate  divided by the standard error, 00:03:02.370 --> 00:03:05.190 is that we standardize the estimate. 00:03:05.190 --> 00:03:13.920 So remember standardization is  subtracting the mean of the estimates, 00:03:13.920 --> 00:03:17.310 so here we assume the mean  to be the null hypothesis, 00:03:17.310 --> 00:03:20.310 so we subtract zero and it  doesn't really make a difference, 00:03:20.310 --> 00:03:23.100 and we divide by standard deviation, 00:03:23.100 --> 00:03:27.030 which case is estimated by standard error here. 00:03:27.030 --> 00:03:29.970 So the t statistic tells us, 00:03:29.970 --> 00:03:35.550 how far from zero the estimate  is, on a standardized metric. 00:03:35.550 --> 00:03:40.800 If it's more than two  standard deviations from zero, 00:03:40.800 --> 00:03:42.750 then we conclude that, 00:03:42.750 --> 00:03:47.460 that kind of observation will be  unlikely to occur, by chance only, 00:03:47.460 --> 00:03:54.750 because 95% of the observations fall within  plus or minus two standard deviations, 00:03:54.750 --> 00:04:00.937 when we have normally distributed statistic. 00:04:00.937 --> 00:04:03.180 So, we compare this area, 00:04:03.180 --> 00:04:07.410 in practice, it often makes  sense to compare both areas here, 00:04:07.410 --> 00:04:09.150 so we calculate this area as well. 00:04:09.150 --> 00:04:10.740 The logic being that, 00:04:11.250 --> 00:04:16.140 it will be an important finding if the  difference was to the other direction as well. 00:04:16.140 --> 00:04:21.300 And this relates to, or is referred  to as one and two-tailed tests. 00:04:21.300 --> 00:04:23.580 So, what area we compare? 00:04:23.580 --> 00:04:28.820 So normally, if we only compare  one end of the normal distribution here, 00:04:28.820 --> 00:04:30.800 this is called a one-tailed test. 00:04:30.800 --> 00:04:34.760 And if we compare the area, 00:04:34.760 --> 00:04:38.270 what is the five percent  area here and here together? 00:04:38.270 --> 00:04:43.190 So this is 2.5 % and this  is 2.5 % so they sum to 5 %, 00:04:43.190 --> 00:04:45.529 then that's called a two-tailed test. 00:04:46.396 --> 00:04:52.400 Normally, when your statistical software  gives you a p-value from a t test, 00:04:52.400 --> 00:04:56.210 or some other test that uses something  that looks like a normal distribution, 00:04:56.210 --> 00:04:58.280 for example a z test, 00:04:58.280 --> 00:05:01.820 then it is two tails, so you compare both ends. 00:05:01.820 --> 00:05:06.320 And it's considered cheating  to use the one-tailed test, 00:05:06.320 --> 00:05:09.530 because what the one-tailed test basically does, 00:05:09.530 --> 00:05:16.730 it gives you a p-value that is exactly half  of the p-value of the two-tailed test. 00:05:16.730 --> 00:05:18.680 Because you have two tails here, 00:05:18.680 --> 00:05:21.350 the probability of the observation being in both. 00:05:21.350 --> 00:05:24.470 Either tail is twice as the probability here, 00:05:24.470 --> 00:05:28.400 so the probability here is half from  what the probability here would be. 00:05:28.400 --> 00:05:32.090 The problem in one-tailed tests is that, 00:05:32.090 --> 00:05:39.770 the standard is to use two tails and if  we observe a p-value in a research paper, 00:05:39.770 --> 00:05:45.410 we assume that it's made by  using this two-tailed test. 00:05:45.410 --> 00:05:54.350 Sometimes if the p-value is 0.06 and a  researcher wants it to be less than 0.05, 00:05:54.350 --> 00:05:56.480 they switch to one-tailed test, 00:05:56.480 --> 00:05:58.940 which allows them to divide the p-value by half, 00:05:58.940 --> 00:06:05.900 and they present those as if they  were tests from the two-tailed test, 00:06:05.900 --> 00:06:08.960 that's misleading readers, that's unethical. 00:06:10.250 --> 00:06:15.320 There are basically no good reasons  ever to use these one-tailed tests, 00:06:15.320 --> 00:06:18.890 because this is more commonly accepted and also, 00:06:18.890 --> 00:06:22.550 if someone wants to have the one-tailed  test instead of a two-tailed test, 00:06:22.550 --> 00:06:25.100 they can just divide your p-values by two 00:06:25.100 --> 00:06:27.965 and that's what the difference is. 00:06:27.965 --> 00:06:32.480 The p-values are very commonly  used in research papers. 00:06:32.480 --> 00:06:35.690 So you see papers, for example, 00:06:35.690 --> 00:06:37.160 this is from Hekman's paper, 00:06:37.160 --> 00:06:40.940 you see these p-values behind statistics, 00:06:40.940 --> 00:06:42.740 so you regression estimate here, 00:06:42.740 --> 00:06:45.290 and then there is p-value less than 0.01, 00:06:45.290 --> 00:06:47.300 that is statistically significant, 00:06:47.300 --> 00:06:50.630 you see this n.s, that means non-significant or, 00:06:50.630 --> 00:06:53.630 you can see p-value is greater than 0.05. 00:06:53.630 --> 00:07:00.860 So for some reason, we have decided  that 5 % p-value is the gold standard, 00:07:00.860 --> 00:07:03.530 and if you have less than 5 %, then it's a good thing, 00:07:03.530 --> 00:07:05.900 if you have more than 5 %, that's a bad thing, 00:07:05.900 --> 00:07:07.460 so that's an arbitrary threshold. 00:07:07.460 --> 00:07:13.010 So, a paper could have  hundreds of p-values easily, 00:07:13.010 --> 00:07:17.392 so it's a very commonly used in research articles. 00:07:18.569 --> 00:07:20.690 P-value relates to two different things, 00:07:20.690 --> 00:07:24.080 so it relates to two different errors. 00:07:24.080 --> 00:07:30.590 And we have two things in statistical analysis, 00:07:30.590 --> 00:07:31.640 we have the population, 00:07:31.640 --> 00:07:34.088 and we have the sample. 00:07:34.088 --> 00:07:40.700 We want to make an inference that  something exists in the population 00:07:40.700 --> 00:07:43.386 using the sample data. 00:07:43.386 --> 00:07:46.258 So we calculate a test statistic, 00:07:46.258 --> 00:07:50.750 the test statistic rejects the  null hypothesis in the sample, 00:07:50.750 --> 00:07:54.350 then we say that we assume it is, 00:07:54.350 --> 00:07:57.080 the null then doesn't hold in the population. 00:07:57.080 --> 00:08:00.950 But that's not actually always the case. 00:08:00.950 --> 00:08:03.890 When you get a p-value that is small, 00:08:03.890 --> 00:08:08.810 it's also possible that it  is a false positive finding. 00:08:08.810 --> 00:08:12.860 So p less 0.05 means that, 00:08:12.860 --> 00:08:14.420 if there was no effect, 00:08:14.420 --> 00:08:19.520 then getting the kind of result  that you just got would be, 00:08:19.520 --> 00:08:21.820 the probability for that would be 5%. 00:08:21.820 --> 00:08:26.840 So 1 out of 20 samples from a population, 00:08:26.840 --> 00:08:28.217 you would be getting a false positive, 00:08:28.217 --> 00:08:31.370 if the null hypothesis  wouldn't hold, doesn't hold. 00:08:31.370 --> 00:08:33.980 So it's possible that it's false positive, 00:08:33.980 --> 00:08:38.390 but it's also possible that it's a true positive. 00:08:38.390 --> 00:08:40.370 So the problem is, we don't know. 00:08:40.370 --> 00:08:43.610 We have evidence that it would be unlikely that 00:08:43.610 --> 00:08:47.120 we would get an effect estimate by chance only. 00:08:47.120 --> 00:08:50.600 Then we conclude that maybe  it wasn't by chance only, 00:08:50.600 --> 00:08:52.100 but we can't know for sure. 00:08:54.300 --> 00:08:56.640 So this is type 1 error, 00:08:56.640 --> 00:08:59.130 then we have type 2 error,  which is a false negative. 00:08:59.130 --> 00:09:02.280 Let's say that the null hypothesis  holds in the population, 00:09:02.280 --> 00:09:07.470 and let's say that women-led companies are  really more profitable than men-led companies, 00:09:07.470 --> 00:09:13.500 but for some reason, our study  couldn't find the difference. 00:09:13.500 --> 00:09:14.640 So that would be a false negative. 00:09:14.640 --> 00:09:17.490 And there is the case that we say that 00:09:17.490 --> 00:09:19.920 we can't reject the null hypothesis, 00:09:19.920 --> 00:09:23.490 we can't reject the claim  that there's no difference, 00:09:23.490 --> 00:09:25.110 and there really is no difference, 00:09:25.110 --> 00:09:27.390 so that's also a valid finding. 00:09:27.390 --> 00:09:32.065 So we want to be sure that we either  have true positives or two negatives. 00:09:32.582 --> 00:09:38.790 The probability of false positives  under the null hypothesis is, 00:09:38.790 --> 00:09:42.810 we consider 5% or less acceptable. 00:09:42.810 --> 00:09:46.590 So if we say that the p-value is valid, 00:09:46.590 --> 00:09:50.119 then it should behave as expected. 00:09:50.119 --> 00:09:56.088 So it's okay for the p-value to be less than 0.05, 00:09:56.088 --> 00:09:58.415 3% of the time, 00:09:58.415 --> 00:10:02.610 if the null hypothesis holds in the population. 00:10:02.610 --> 00:10:06.630 So we have a conservative test, that's okay. 00:10:06.630 --> 00:10:10.200 So we want to make errors to be cautious. 00:10:10.200 --> 00:10:16.320 But if our p-value was less than  0.05, let's say 7 % of the time, 00:10:16.320 --> 00:10:19.830 then you would say that it's too liberal and 00:10:19.830 --> 00:10:22.950 it's not the valid p-value  for the particular test, 00:10:22.950 --> 00:10:25.620 because it doesn't follow  the reference distribution. 00:10:25.620 --> 00:10:29.584 It's important that when the null hypothesis hold, 00:10:29.584 --> 00:10:33.810 our p-values don't indicate the support too often. 00:10:34.616 --> 00:10:37.800 Then we have another concept  called statistical power, 00:10:37.800 --> 00:10:40.294 so this is a false positive rate. 00:10:40.294 --> 00:10:43.890 And statistical power is something that, 00:10:43.890 --> 00:10:49.650 once we have a statistic whose p-value  doesn't exceed a false positive rate, 00:10:49.650 --> 00:10:53.160 we want the test to identify an effect, 00:10:53.160 --> 00:10:56.916 when it exists as frequently as possible. 00:10:57.144 --> 00:10:59.792 Typically we are ok with 80%, 00:10:59.792 --> 00:11:03.390 but there are studies with way less power. 00:11:03.390 --> 00:11:07.770 So 80% power means that when  there is an effect the population 00:11:07.770 --> 00:11:12.300 then in four out of five studies,  we would actually detect an effect. 00:11:12.300 --> 00:11:15.360 The question is, which one is more important? 00:11:15.360 --> 00:11:19.710 So we're not okay with more  than 5 % false-positive rates, 00:11:19.710 --> 00:11:23.430 but we are okay with 20 % false-negative rates, 00:11:23.430 --> 00:11:26.520 because of 80 % power 20 % false negative. 00:11:26.520 --> 00:11:28.710 Then the reason why, 00:11:29.730 --> 00:11:33.420 we are so much more worried  about false positives is that, 00:11:33.420 --> 00:11:36.870 positive effects typically have  some kind of policy implications. 00:11:36.870 --> 00:11:42.480 If we find out that a medicine  doesn't do us any good, 00:11:42.480 --> 00:11:44.850 then no one is going to take the  medicine, we continue to research. 00:11:44.850 --> 00:11:48.270 If we find out that the medicine helps people, 00:11:48.270 --> 00:11:50.490 then people will start taking the medicine. 00:11:50.490 --> 00:11:53.490 If it's a false positive finding, 00:11:53.490 --> 00:11:57.750 then people will take a medicine that is useless, 00:11:57.750 --> 00:11:59.610 or could be even harmful to them. 00:11:59.610 --> 00:12:05.370 So false positives have poisonous implications much more often than false negatives, 00:12:05.370 --> 00:12:08.970 and that's the reason why we  want to avoid false positives. 00:12:08.970 --> 00:12:11.850 We have agreed that it's okay to have a 5 % rate, 00:12:11.850 --> 00:12:15.840 hence p is less than 0.05, but not more. 00:12:16.470 --> 00:12:21.300 Of course, in some scenarios, if you  have a really like a life-critical thing, 00:12:21.300 --> 00:12:27.060 then you could be using a p-value  threshold of 0.001, for example. 00:12:27.060 --> 00:12:31.530 So 0.05 is not the one correct value, 00:12:31.530 --> 00:12:33.600 that's just the convention in many fields. 00:12:33.600 --> 00:12:35.970 Some other fields use smaller values and 00:12:35.970 --> 00:12:38.520 you can use smaller values in  an individual study as well