WEBVTT Kind: captions Language: en 00:00:00.090 --> 00:00:03.360 There are some controversies related to the use of p-values. 00:00:03.360 --> 00:00:09.030 So, why p-values are the most common way that we use for statistical inference now? 00:00:09.030 --> 00:00:11.981 There are issues with technique. 00:00:12.147 --> 00:00:14.396 Some of the issues are fundamental, 00:00:14.396 --> 00:00:18.510 like some people claim that the  null hypothesis significance testing 00:00:18.510 --> 00:00:20.499 is an illogical approach. 00:00:20.499 --> 00:00:23.179 It doesn't answer the question we want to answer, 00:00:23.345 --> 00:00:25.925 and it focuses on the incorrect thing. 00:00:25.925 --> 00:00:28.920 Also, there is some evidence that 00:00:28.920 --> 00:00:31.650 if we base our publication decisions on, 00:00:31.650 --> 00:00:33.640 which studies get published and which are not, 00:00:33.782 --> 00:00:35.222 on the p-values, 00:00:35.222 --> 00:00:37.890 that will distort the body of knowledge. 00:00:37.890 --> 00:00:43.830 So only studies that support the  hypothesis are going to be accepted, 00:00:44.091 --> 00:00:47.031 therefore there is bias towards confirmation. 00:00:47.648 --> 00:00:52.664 Some of these problems are not specific  to null hypothesis significance testing, 00:00:52.972 --> 00:00:55.282 but it's still useful to understand, 00:00:55.282 --> 00:00:58.892 what limitations and what common  misunderstandings there are 00:00:59.059 --> 00:01:00.739 about these techniques. 00:01:01.214 --> 00:01:11.584 So this slide lists six different statements about 00:01:11.584 --> 00:01:13.684 null hypothesis significance testing. 00:01:14.634 --> 00:01:17.064 Assume that we have identified, 00:01:17.468 --> 00:01:22.401 that p is less than 0.01 in our study. 00:01:23.018 --> 00:01:26.260 Does it mean that we have  disproven the null hypothesis? 00:01:26.664 --> 00:01:28.584 That there is no evidence, 00:01:28.893 --> 00:01:31.059 that there's no difference in the populations. 00:01:31.296 --> 00:01:35.759 Have we found the probability  that the null hypothesis is true? 00:01:35.759 --> 00:01:39.014 at one person probability level? 00:01:39.014 --> 00:01:42.544 Have we proven an experimental  hypothesis that there is a difference? 00:01:43.043 --> 00:01:47.124 Can we deduce the probability of an  experimental hypothesis being true? 00:01:47.907 --> 00:01:51.087 We know that if we rejected the null hypothesis, 00:01:51.087 --> 00:01:55.000 the probability that we are  making a wrong decision is small, 00:01:55.000 --> 00:01:58.200 or we have a reliable finding in the sense that 00:01:58.200 --> 00:02:01.230 if this experiment was repeated then 00:02:01.230 --> 00:02:06.570 the repeated replication  would arrive to the same result. 00:02:07.235 --> 00:02:09.665 All of these are false. 00:02:09.665 --> 00:02:13.950 These are commonly held beliefs listed  in the Strategic Management Journal, 00:02:13.950 --> 00:02:16.560 and you can see these all  over in published articles, 00:02:16.560 --> 00:02:17.764 but they're not true, 00:02:17.764 --> 00:02:20.485 because p-value doesn't tell us anything, 00:02:20.937 --> 00:02:24.173 these things at least directly. 00:02:24.173 --> 00:02:27.840 So we have to understand the  criticisms of the p-value. 00:02:27.840 --> 00:02:31.740 And there are three main points in the criticism. 00:02:32.737 --> 00:02:35.910 One is that the p-value doesn't really tell us 00:02:35.910 --> 00:02:37.110 what we want to know. 00:02:37.822 --> 00:02:39.302 So we want to know, 00:02:39.302 --> 00:02:41.334 when we have our sample data, 00:02:41.429 --> 00:02:43.917 we have calculated something from the sample, 00:02:44.036 --> 00:02:46.237 we want to know how certain we are 00:02:46.237 --> 00:02:48.847 there is an effect on in the population. 00:02:48.847 --> 00:02:51.780 P-value doesn't tell us that, 00:02:51.780 --> 00:02:57.556 it tells us, what is the probability  of getting the effect that we just got 00:02:57.770 --> 00:03:00.350 in the hypothetical scenario, 00:03:00.350 --> 00:03:03.152 that there is no effect in the population. 00:03:03.152 --> 00:03:05.879 So given these data, 00:03:05.879 --> 00:03:09.392 the probability of the null hypothesis  being true is not the same as, 00:03:09.392 --> 00:03:11.565 what is the probability of this data, 00:03:11.565 --> 00:03:13.530 given the null hypothesis is true. 00:03:14.124 --> 00:03:16.586 So p-value doesn't really tell us, 00:03:16.586 --> 00:03:18.427 what we would like to know. 00:03:18.427 --> 00:03:20.670 But we so much wanted it to tell us that, 00:03:20.670 --> 00:03:22.970 that we nevertheless often say that 00:03:22.970 --> 00:03:28.110 a p-value is an evidence for  an existence of an effect, 00:03:28.110 --> 00:03:32.024 whereas in reality, it is only  indirect evidence at best. 00:03:33.455 --> 00:03:35.522 There's also another criticism, 00:03:35.522 --> 00:03:38.190 another angle to the criticism. 00:03:38.190 --> 00:03:40.994 It is that the p-values are illogical. 00:03:41.920 --> 00:03:45.799 So the idea of the p-value is based on its, 00:03:47.698 --> 00:03:50.331 you could think of it as deductive reasoning. 00:03:50.331 --> 00:03:51.915 So the idea here is that 00:03:51.915 --> 00:03:55.385 if the null hypothesis is true, 00:03:55.741 --> 00:03:58.370 then we should not observe data. 00:03:58.892 --> 00:04:01.772 We observe certain kind of data, 00:04:01.772 --> 00:04:05.326 therefore the null hypothesis can't be true. 00:04:05.326 --> 00:04:09.270 If these are statements, 00:04:09.270 --> 00:04:11.010 that have absolute truth values. 00:04:11.010 --> 00:04:16.410 So if the null hypothesis, then not data, 00:04:16.410 --> 00:04:20.880 if data, therefore no null hypothesis. 00:04:21.141 --> 00:04:22.161 Works well, 00:04:22.161 --> 00:04:24.835 but the problem is that 00:04:25.405 --> 00:04:28.370 the null hypothesis is a probabilistic statement. 00:04:28.584 --> 00:04:32.002 So we say that if the null hypothesis is true, 00:04:32.002 --> 00:04:35.177 then observing a data is very unlikely. 00:04:35.177 --> 00:04:38.170 So the p-value quantifies that likelihood. 00:04:38.170 --> 00:04:43.150 So we just say that it's unlikely to have  that kind of observations by chance only. 00:04:43.150 --> 00:04:45.000 We get an observation, 00:04:45.000 --> 00:04:50.740 we cannot conclude that the null  hypothesis is very unlikely. 00:04:51.370 --> 00:04:52.844 This is not logically valid. 00:04:53.723 --> 00:04:55.202 One way to understand, 00:04:55.202 --> 00:04:56.470 why this is not logically valid, 00:04:56.470 --> 00:05:00.970 is to just put some meaningful statements instead of the null hypothesis, 00:05:00.970 --> 00:05:03.049 and D* the data. 00:05:03.452 --> 00:05:05.762 So this one could be for example, 00:05:05.952 --> 00:05:08.592 a classic example is that, 00:05:09.439 --> 00:05:11.514 if a person is hanged, 00:05:11.514 --> 00:05:12.774 then the person is dead, 00:05:13.367 --> 00:05:16.682 a person is not dead, therefore  the person was not hanged. 00:05:16.991 --> 00:05:18.601 So the null hypothesis is, 00:05:18.601 --> 00:05:22.000 that person is hanged, the observed  consequence is that he's dead, 00:05:22.000 --> 00:05:25.510 then if he's not dead, then he was not hanged. 00:05:25.771 --> 00:05:26.701 It works well. 00:05:27.960 --> 00:05:30.340 When probabilistic statements are made, 00:05:30.340 --> 00:05:31.600 it breaks apart. 00:05:31.600 --> 00:05:34.960 So a classic example is that, 00:05:34.960 --> 00:05:36.940 if a person is American, 00:05:36.940 --> 00:05:40.300 then a person is very unlikely  to be a member of a Congress, 00:05:41.260 --> 00:05:44.170 because Congress has some hundreds of people, 00:05:44.170 --> 00:05:46.870 and there are hundreds of  million people in America. 00:05:46.870 --> 00:05:50.410 So it's very unlikely that an  American is a member of a Congress. 00:05:51.146 --> 00:05:54.670 Then we observe that the person  is a member of a Congress, 00:05:54.670 --> 00:06:00.130 we cannot make the inference that it is very unlikely that the person is an American, 00:06:00.130 --> 00:06:03.310 because you have to be American  to be a member of the Congress. 00:06:03.737 --> 00:06:06.727 So when we move to probabilistic statements, 00:06:06.727 --> 00:06:09.275 from these true and false statements, 00:06:09.275 --> 00:06:11.581 which tell absolute values, 00:06:11.581 --> 00:06:13.872 then things break apart. 00:06:14.727 --> 00:06:16.570 Then the final criticism, 00:06:16.570 --> 00:06:19.037 which I think is the most important one, 00:06:19.037 --> 00:06:21.677 and most commonly misunderstood. 00:06:21.677 --> 00:06:26.050 Is that the small p-value doesn't tell us, 00:06:26.050 --> 00:06:28.450 whether there is an important effect. 00:06:28.450 --> 00:06:35.980 It only tells us something about the plausibility of the effect being zero in the population. 00:06:36.692 --> 00:06:39.940 There are many effects 00:06:39.940 --> 00:06:44.080 that are not zero but are so small  that they don't make any difference. 00:06:44.436 --> 00:06:46.416 And p-value doesn't tell us, 00:06:46.416 --> 00:06:49.452 whether an effect is meaningfully large. 00:06:49.452 --> 00:06:53.290 You have to interpret other  statistics to understand that. 00:06:53.290 --> 00:06:57.566 And this is a big problem because  quite often you see articles, 00:06:57.780 --> 00:07:00.360 for example, applying regression analysis, 00:07:00.360 --> 00:07:03.445 they conclude that an effect  is statistically significant, 00:07:03.445 --> 00:07:06.910 and after that, they don't really interpret 00:07:06.910 --> 00:07:08.470 whether the effect is large or not. 00:07:08.470 --> 00:07:10.600 They just say that it's not zero, 00:07:10.600 --> 00:07:13.150 therefore we have an interesting result. 00:07:13.482 --> 00:07:15.387 It doesn't work that way. 00:07:15.387 --> 00:07:17.200 You have to have a meaningfully large effect 00:07:17.200 --> 00:07:18.959 to have an interesting result. 00:07:18.959 --> 00:07:21.850 And p-value doesn't tell you that, unfortunately. 00:07:22.966 --> 00:07:24.880 So p-values have these issues, 00:07:24.880 --> 00:07:28.006 they are misunderstood and 00:07:28.006 --> 00:07:30.850 they have some fundamental issues and also 00:07:30.850 --> 00:07:32.830 the way we use p-value, 00:07:32.830 --> 00:07:35.770 to judge, which papers of  publishable and which are not, 00:07:35.770 --> 00:07:37.070 is problematic. 00:07:37.521 --> 00:07:40.360 So there have been some efforts  to address these issues. 00:07:40.360 --> 00:07:43.120 One extreme is that 00:07:43.120 --> 00:07:47.200 a journal and some journals are banning  null hypothesis significance testing. 00:07:47.770 --> 00:07:51.671 So if your article includes any p-values, 00:07:52.170 --> 00:07:58.402 don't try to submit it to Basic and  Applied Social Psychology, for example. 00:07:58.402 --> 00:08:03.910 They will send every article that  has p-values back to the author, 00:08:03.910 --> 00:08:05.860 and tell the authors to remove the p-values. 00:08:06.549 --> 00:08:10.600 Strategic Management Journal  is de-emphasizing p-values too. 00:08:10.600 --> 00:08:14.980 So there is a trend of de-emphasizing  or even outright banning p-value. 00:08:14.980 --> 00:08:16.870 So that's one way of addressing the issue. 00:08:18.105 --> 00:08:24.700 Another way of addressing the  issue is related to the choice of, 00:08:24.700 --> 00:08:26.141 which studies are published? 00:08:26.545 --> 00:08:28.735 So if we only publish studies 00:08:28.735 --> 00:08:31.675 that have small p-values, 00:08:31.841 --> 00:08:35.681 then we will inflate the false positive rates, 00:08:35.752 --> 00:08:38.092 and we will also bias the results. 00:08:38.140 --> 00:08:41.230 Why that's the case I will  explain in another video. 00:08:41.657 --> 00:08:45.010 But a way to address this problem is, 00:08:45.010 --> 00:08:47.350 that studies are registered. 00:08:47.350 --> 00:08:49.600 So before you analyze your data, 00:08:49.600 --> 00:08:51.880 before you collect your sample, 00:08:51.880 --> 00:08:53.620 you write a study plan, 00:08:53.620 --> 00:08:56.530 and then you submit it to  an online repository that, 00:08:58.630 --> 00:09:00.760 tells the readers, what you're planning to do. 00:09:01.164 --> 00:09:02.934 And that will be reviewed, 00:09:02.934 --> 00:09:05.940 instead of your paper with the results. 00:09:05.964 --> 00:09:10.044 Then are when your research  plan is being reviewed, 00:09:10.044 --> 00:09:11.770 then that means that you have actually, 00:09:11.770 --> 00:09:14.590 your study is not being reviewed based on, 00:09:14.590 --> 00:09:17.200 whether the p-values are small or not, 00:09:17.200 --> 00:09:20.890 but it's based, the review is  based on the strength of the design, 00:09:20.890 --> 00:09:25.210 which is a much more meaningful metric for quality 00:09:25.210 --> 00:09:26.830 then the p-value. 00:09:27.186 --> 00:09:29.766 So this is another, registered reports, 00:09:29.766 --> 00:09:33.700 an upcoming trend that you should be aware of. 00:09:34.365 --> 00:09:38.830 There are lots of readings about controversies, 00:09:38.830 --> 00:09:45.910 and I really like the paper by Nuzzo  in Nature about statistical errors. 00:09:45.910 --> 00:09:47.410 It's a three or four-page paper 00:09:47.410 --> 00:09:51.010 that explains these issues that I  went through in this presentation, 00:09:51.010 --> 00:09:54.700 and it's well worth the time reading. 00:09:56.386 --> 00:09:59.680 Then there's the question of, can we do better? 00:09:59.680 --> 00:10:07.120 So the null hypothesis significance testing, the p-value and the confidence interval, 00:10:07.120 --> 00:10:08.770 they suffer from the same problem, 00:10:08.770 --> 00:10:09.670 so can we do better? 00:10:09.670 --> 00:10:12.425 So what alternatives do we have? 00:10:12.425 --> 00:10:15.070 So if we ban null hypothesis significance testing, 00:10:15.070 --> 00:10:17.179 we ban confidence intervals, 00:10:17.179 --> 00:10:19.000 then what's remaining? 00:10:19.000 --> 00:10:22.780 Not reporting anything about the precision of the estimates, 00:10:22.780 --> 00:10:24.850 that would not be a good idea. 00:10:24.850 --> 00:10:27.100 So can we do better? 00:10:27.100 --> 00:10:29.260 So ultimately we want to know, 00:10:29.260 --> 00:10:33.247 or say something about the truth  value of the effect in the population. 00:10:34.126 --> 00:10:37.120 So instead of saying that the p-value, 00:10:37.120 --> 00:10:40.360 when we observe that we reject a null hypothesis, 00:10:41.215 --> 00:10:42.490 we would like to say, 00:10:42.490 --> 00:10:47.890 how confident we are that the null  hypothesis is true in the population. 00:10:48.650 --> 00:10:51.560 But we don't know that based on the p-value. 00:10:51.560 --> 00:10:54.152 We can know that, 00:10:54.152 --> 00:11:00.940 if we know, what is the distribution of  the true hypothesis in the population? 00:11:01.652 --> 00:11:03.482 So let's take an example. 00:11:03.482 --> 00:11:05.669 Let's take an example of clairvoyance. 00:11:06.951 --> 00:11:10.339 So we have a hypothetical example, 00:11:10.339 --> 00:11:12.940 where we throw two dice, 00:11:12.940 --> 00:11:16.467 and we ask a person to guess the two dice. 00:11:17.156 --> 00:11:19.264 They can either guess the two dice, 00:11:19.264 --> 00:11:20.690 or if they can see the future, 00:11:20.690 --> 00:11:22.730 they will know what the two dice are, 00:11:22.730 --> 00:11:24.665 and they can answer correctly. 00:11:25.425 --> 00:11:28.100 So if a person answers the question of two dice, 00:11:28.100 --> 00:11:30.196 let's say that they are one or six, 00:11:30.196 --> 00:11:32.684 if the person answers the question correctly, 00:11:33.064 --> 00:11:35.060 then we reject the null hypothesis, 00:11:35.060 --> 00:11:39.455 because getting two dice  correctly is one out of 36, 00:11:39.455 --> 00:11:44.087 and that's less than 0.5. 00:11:44.206 --> 00:11:46.697 So either we have a false positive, 00:11:46.697 --> 00:11:51.320 our test had revealed that  the person was guessing, 00:11:51.320 --> 00:11:53.960 or it's a true positive that  the person actually knew 00:11:53.960 --> 00:11:55.040 what the dice are gonna be, 00:11:55.040 --> 00:11:56.697 so he was clairvoyant. 00:11:58.027 --> 00:12:00.155 So what do we do? 00:12:00.155 --> 00:12:03.410 We can if we know that based on the tests, 00:12:03.410 --> 00:12:08.120 that it has 1 out of 36 false-positive rates, 00:12:08.120 --> 00:12:11.225 so guessing correctly is about one out of 36. 00:12:11.605 --> 00:12:18.320 And let's say that this test has a 100  % success rate for clairvoyant people. 00:12:18.747 --> 00:12:23.630 So if the person can foresee the  dice, they must answer correctly. 00:12:26.147 --> 00:12:31.040 If there are one in a million  people that are clairvoyant, 00:12:31.040 --> 00:12:34.862 let's say that it exists  but it's fantastically rare. 00:12:34.862 --> 00:12:36.230 Then we can say that, 00:12:36.230 --> 00:12:39.380 if we rejected the null hypothesis, 00:12:39.380 --> 00:12:45.740 the probability of the person being  clairvoyant is still 1 out of about 28 000. 00:12:45.740 --> 00:12:50.600 The reason being that these people, 00:12:50.600 --> 00:12:54.380 who are not clairvoyant, 999 %, 00:12:54.380 --> 00:12:59.564 999 999 people are not clairvoyant, 00:13:00.846 --> 00:13:06.350 1 out of 36 is here, which is 27 279, 00:13:06.730 --> 00:13:09.190 and the remaining are here. 00:13:09.190 --> 00:13:11.427 There is one clairvoyant person, 00:13:11.427 --> 00:13:12.920 who gets it correctly 00:13:12.920 --> 00:13:14.900 and there are no false negatives. 00:13:14.900 --> 00:13:22.140 So we compare one against 27 779. 00:13:22.140 --> 00:13:25.230 So this is the idea of Bayesian statistics. 00:13:25.230 --> 00:13:27.870 So we include prior information, 00:13:27.870 --> 00:13:31.230 our beliefs about the prior  distribution of the phenomenon 00:13:31.230 --> 00:13:32.220 into our analysis, 00:13:32.220 --> 00:13:36.060 and then we can, based  on that prior information, 00:13:36.060 --> 00:13:39.812 we can say something about the  phenomena that we're studying, 00:13:39.812 --> 00:13:43.417 that goes beyond the p-value. 00:13:43.417 --> 00:13:47.000 The problem, of course, is that, 00:13:47.000 --> 00:13:49.245 how do we know these priors, 00:13:49.245 --> 00:13:53.561 how do we know that our  test has 100% success rate, 00:13:53.561 --> 00:13:58.170 how do we know that there are one in  a million people that are clairvoyant? 00:13:58.526 --> 00:14:01.166 So that's the problem of Bayesian analysis. 00:14:01.380 --> 00:14:02.670 In Bayesian analysis, 00:14:02.670 --> 00:14:05.670 we add information to the analysis based on, 00:14:05.670 --> 00:14:07.389 what we know before the study, 00:14:07.650 --> 00:14:11.730 and that allows us to make inferences  that are slightly different from p-values, 00:14:12.466 --> 00:14:15.046 but the problem is that how would we know? 00:14:17.088 --> 00:14:19.520 The Bayesian analysis sounds attractive, 00:14:19.520 --> 00:14:23.910 but it has been available for a long long time, 00:14:23.910 --> 00:14:25.736 so this is not a new thing. 00:14:25.736 --> 00:14:29.940 And Bayesian analysis has been  coming also for a long long time. 00:14:29.940 --> 00:14:36.611 For example, there's this article now that was published recently in Journal of Management, 00:14:36.611 --> 00:14:41.635 that says that time for Bayesian analysis is now. 00:14:42.015 --> 00:14:45.000 There has been this kind of  articles a long time ago. 00:14:45.000 --> 00:14:51.900 This is something also that  you should be keeping up with. 00:14:51.900 --> 00:14:55.530 So it could be that these Bayesian analyses, 00:14:55.530 --> 00:15:00.990 where we include prior information  to our statistical analysis, 00:15:00.990 --> 00:15:03.210 it becomes more popular. 00:15:03.210 --> 00:15:06.270 But nowadays it's not that commonly used. 00:15:06.270 --> 00:15:07.830 I can't tell you any papers, 00:15:07.830 --> 00:15:09.990 that would have an applied Bayesian logic, 00:15:09.990 --> 00:15:11.896 I've applied it in one paper myself. 00:15:12.086 --> 00:15:15.210 The problem with Bayesian  analysis is not only the priors, 00:15:15.210 --> 00:15:16.740 which you would have to know, 00:15:16.740 --> 00:15:24.089 it's also that because these are not  commonly used in business studies then, 00:15:24.089 --> 00:15:27.930 reviewers who get these studies on their desks, 00:15:27.930 --> 00:15:29.070 don't know what to do with them. 00:15:29.569 --> 00:15:31.609 So we don't know, how to properly evaluate, 00:15:31.609 --> 00:15:34.530 perhaps we are sceptical of  those studies, for that reason. 00:15:35.124 --> 00:15:37.684 So there's kind of like a chicken-and-egg problem. 00:15:37.684 --> 00:15:42.220 This could be in some ways a better  approach to doing statistical analysis, 00:15:42.220 --> 00:15:46.364 but because we have not been doing it in the past, 00:15:46.554 --> 00:15:48.264 we're not doing it now. 00:15:48.335 --> 00:15:50.525 Another pragmatic approach is, 00:15:50.525 --> 00:15:52.750 a pragmatic way of thinking about this is that, 00:15:52.750 --> 00:15:54.430 you will have to know p-values anyway, 00:15:54.430 --> 00:15:58.270 because 99% of the studies that are published, 00:15:58.270 --> 00:16:00.751 at least, use p-values. 00:16:00.751 --> 00:16:03.415 Once you know p-value based statistics, 00:16:03.510 --> 00:16:06.180 then you can start publishing yourself. 00:16:06.180 --> 00:16:10.270 Or you could spend an additional  year and learning Bayes analysis. 00:16:10.270 --> 00:16:12.820 Most people will choose to go publishing, 00:16:12.820 --> 00:16:16.930 because you are a PhD and then  your tenure depends on that.