WEBVTT
Kind: captions
Language: en
00:00:00.090 --> 00:00:03.360
There are some controversies
related to the use of p-values.
00:00:03.360 --> 00:00:09.030
So, why p-values are the most common way that we use for statistical inference now?
00:00:09.030 --> 00:00:11.981
There are issues with technique.
00:00:12.147 --> 00:00:14.396
Some of the issues are fundamental,
00:00:14.396 --> 00:00:18.510
like some people claim that the
null hypothesis significance testing
00:00:18.510 --> 00:00:20.499
is an illogical approach.
00:00:20.499 --> 00:00:23.179
It doesn't answer the question we want to answer,
00:00:23.345 --> 00:00:25.925
and it focuses on the incorrect thing.
00:00:25.925 --> 00:00:28.920
Also, there is some evidence that
00:00:28.920 --> 00:00:31.650
if we base our publication decisions on,
00:00:31.650 --> 00:00:33.640
which studies get published and which are not,
00:00:33.782 --> 00:00:35.222
on the p-values,
00:00:35.222 --> 00:00:37.890
that will distort the body of knowledge.
00:00:37.890 --> 00:00:43.830
So only studies that support the
hypothesis are going to be accepted,
00:00:44.091 --> 00:00:47.031
therefore there is bias towards confirmation.
00:00:47.648 --> 00:00:52.664
Some of these problems are not specific
to null hypothesis significance testing,
00:00:52.972 --> 00:00:55.282
but it's still useful to understand,
00:00:55.282 --> 00:00:58.892
what limitations and what common
misunderstandings there are
00:00:59.059 --> 00:01:00.739
about these techniques.
00:01:01.214 --> 00:01:11.584
So this slide lists six different statements about
00:01:11.584 --> 00:01:13.684
null hypothesis significance testing.
00:01:14.634 --> 00:01:17.064
Assume that we have identified,
00:01:17.468 --> 00:01:22.401
that p is less than 0.01 in our study.
00:01:23.018 --> 00:01:26.260
Does it mean that we have
disproven the null hypothesis?
00:01:26.664 --> 00:01:28.584
That there is no evidence,
00:01:28.893 --> 00:01:31.059
that there's no difference in the populations.
00:01:31.296 --> 00:01:35.759
Have we found the probability
that the null hypothesis is true?
00:01:35.759 --> 00:01:39.014
at one person probability level?
00:01:39.014 --> 00:01:42.544
Have we proven an experimental
hypothesis that there is a difference?
00:01:43.043 --> 00:01:47.124
Can we deduce the probability of an
experimental hypothesis being true?
00:01:47.907 --> 00:01:51.087
We know that if we rejected the null hypothesis,
00:01:51.087 --> 00:01:55.000
the probability that we are
making a wrong decision is small,
00:01:55.000 --> 00:01:58.200
or we have a reliable finding in the sense that
00:01:58.200 --> 00:02:01.230
if this experiment was repeated then
00:02:01.230 --> 00:02:06.570
the repeated replication
would arrive to the same result.
00:02:07.235 --> 00:02:09.665
All of these are false.
00:02:09.665 --> 00:02:13.950
These are commonly held beliefs listed
in the Strategic Management Journal,
00:02:13.950 --> 00:02:16.560
and you can see these all
over in published articles,
00:02:16.560 --> 00:02:17.764
but they're not true,
00:02:17.764 --> 00:02:20.485
because p-value doesn't tell us anything,
00:02:20.937 --> 00:02:24.173
these things at least directly.
00:02:24.173 --> 00:02:27.840
So we have to understand the
criticisms of the p-value.
00:02:27.840 --> 00:02:31.740
And there are three main points in the criticism.
00:02:32.737 --> 00:02:35.910
One is that the p-value doesn't really tell us
00:02:35.910 --> 00:02:37.110
what we want to know.
00:02:37.822 --> 00:02:39.302
So we want to know,
00:02:39.302 --> 00:02:41.334
when we have our sample data,
00:02:41.429 --> 00:02:43.917
we have calculated something from the sample,
00:02:44.036 --> 00:02:46.237
we want to know how certain we are
00:02:46.237 --> 00:02:48.847
there is an effect on in the population.
00:02:48.847 --> 00:02:51.780
P-value doesn't tell us that,
00:02:51.780 --> 00:02:57.556
it tells us, what is the probability
of getting the effect that we just got
00:02:57.770 --> 00:03:00.350
in the hypothetical scenario,
00:03:00.350 --> 00:03:03.152
that there is no effect in the population.
00:03:03.152 --> 00:03:05.879
So given these data,
00:03:05.879 --> 00:03:09.392
the probability of the null hypothesis
being true is not the same as,
00:03:09.392 --> 00:03:11.565
what is the probability of this data,
00:03:11.565 --> 00:03:13.530
given the null hypothesis is true.
00:03:14.124 --> 00:03:16.586
So p-value doesn't really tell us,
00:03:16.586 --> 00:03:18.427
what we would like to know.
00:03:18.427 --> 00:03:20.670
But we so much wanted it to tell us that,
00:03:20.670 --> 00:03:22.970
that we nevertheless often say that
00:03:22.970 --> 00:03:28.110
a p-value is an evidence for
an existence of an effect,
00:03:28.110 --> 00:03:32.024
whereas in reality, it is only
indirect evidence at best.
00:03:33.455 --> 00:03:35.522
There's also another criticism,
00:03:35.522 --> 00:03:38.190
another angle to the criticism.
00:03:38.190 --> 00:03:40.994
It is that the p-values are illogical.
00:03:41.920 --> 00:03:45.799
So the idea of the p-value is based on its,
00:03:47.698 --> 00:03:50.331
you could think of it as deductive reasoning.
00:03:50.331 --> 00:03:51.915
So the idea here is that
00:03:51.915 --> 00:03:55.385
if the null hypothesis is true,
00:03:55.741 --> 00:03:58.370
then we should not observe data.
00:03:58.892 --> 00:04:01.772
We observe certain kind of data,
00:04:01.772 --> 00:04:05.326
therefore the null hypothesis can't be true.
00:04:05.326 --> 00:04:09.270
If these are statements,
00:04:09.270 --> 00:04:11.010
that have absolute truth values.
00:04:11.010 --> 00:04:16.410
So if the null hypothesis, then not data,
00:04:16.410 --> 00:04:20.880
if data, therefore no null hypothesis.
00:04:21.141 --> 00:04:22.161
Works well,
00:04:22.161 --> 00:04:24.835
but the problem is that
00:04:25.405 --> 00:04:28.370
the null hypothesis is a probabilistic statement.
00:04:28.584 --> 00:04:32.002
So we say that if the null hypothesis is true,
00:04:32.002 --> 00:04:35.177
then observing a data is very unlikely.
00:04:35.177 --> 00:04:38.170
So the p-value quantifies that likelihood.
00:04:38.170 --> 00:04:43.150
So we just say that it's unlikely to have
that kind of observations by chance only.
00:04:43.150 --> 00:04:45.000
We get an observation,
00:04:45.000 --> 00:04:50.740
we cannot conclude that the null
hypothesis is very unlikely.
00:04:51.370 --> 00:04:52.844
This is not logically valid.
00:04:53.723 --> 00:04:55.202
One way to understand,
00:04:55.202 --> 00:04:56.470
why this is not logically valid,
00:04:56.470 --> 00:05:00.970
is to just put some meaningful statements instead of the null hypothesis,
00:05:00.970 --> 00:05:03.049
and D* the data.
00:05:03.452 --> 00:05:05.762
So this one could be for example,
00:05:05.952 --> 00:05:08.592
a classic example is that,
00:05:09.439 --> 00:05:11.514
if a person is hanged,
00:05:11.514 --> 00:05:12.774
then the person is dead,
00:05:13.367 --> 00:05:16.682
a person is not dead, therefore
the person was not hanged.
00:05:16.991 --> 00:05:18.601
So the null hypothesis is,
00:05:18.601 --> 00:05:22.000
that person is hanged, the observed
consequence is that he's dead,
00:05:22.000 --> 00:05:25.510
then if he's not dead, then he was not hanged.
00:05:25.771 --> 00:05:26.701
It works well.
00:05:27.960 --> 00:05:30.340
When probabilistic statements are made,
00:05:30.340 --> 00:05:31.600
it breaks apart.
00:05:31.600 --> 00:05:34.960
So a classic example is that,
00:05:34.960 --> 00:05:36.940
if a person is American,
00:05:36.940 --> 00:05:40.300
then a person is very unlikely
to be a member of a Congress,
00:05:41.260 --> 00:05:44.170
because Congress has some hundreds of people,
00:05:44.170 --> 00:05:46.870
and there are hundreds of
million people in America.
00:05:46.870 --> 00:05:50.410
So it's very unlikely that an
American is a member of a Congress.
00:05:51.146 --> 00:05:54.670
Then we observe that the person
is a member of a Congress,
00:05:54.670 --> 00:06:00.130
we cannot make the inference that it is very unlikely that the person is an American,
00:06:00.130 --> 00:06:03.310
because you have to be American
to be a member of the Congress.
00:06:03.737 --> 00:06:06.727
So when we move to probabilistic statements,
00:06:06.727 --> 00:06:09.275
from these true and false statements,
00:06:09.275 --> 00:06:11.581
which tell absolute values,
00:06:11.581 --> 00:06:13.872
then things break apart.
00:06:14.727 --> 00:06:16.570
Then the final criticism,
00:06:16.570 --> 00:06:19.037
which I think is the most important one,
00:06:19.037 --> 00:06:21.677
and most commonly misunderstood.
00:06:21.677 --> 00:06:26.050
Is that the small p-value doesn't tell us,
00:06:26.050 --> 00:06:28.450
whether there is an important effect.
00:06:28.450 --> 00:06:35.980
It only tells us something about the plausibility of the effect being zero in the population.
00:06:36.692 --> 00:06:39.940
There are many effects
00:06:39.940 --> 00:06:44.080
that are not zero but are so small
that they don't make any difference.
00:06:44.436 --> 00:06:46.416
And p-value doesn't tell us,
00:06:46.416 --> 00:06:49.452
whether an effect is meaningfully large.
00:06:49.452 --> 00:06:53.290
You have to interpret other
statistics to understand that.
00:06:53.290 --> 00:06:57.566
And this is a big problem because
quite often you see articles,
00:06:57.780 --> 00:07:00.360
for example, applying regression analysis,
00:07:00.360 --> 00:07:03.445
they conclude that an effect
is statistically significant,
00:07:03.445 --> 00:07:06.910
and after that, they don't really interpret
00:07:06.910 --> 00:07:08.470
whether the effect is large or not.
00:07:08.470 --> 00:07:10.600
They just say that it's not zero,
00:07:10.600 --> 00:07:13.150
therefore we have an interesting result.
00:07:13.482 --> 00:07:15.387
It doesn't work that way.
00:07:15.387 --> 00:07:17.200
You have to have a meaningfully large effect
00:07:17.200 --> 00:07:18.959
to have an interesting result.
00:07:18.959 --> 00:07:21.850
And p-value doesn't tell you that, unfortunately.
00:07:22.966 --> 00:07:24.880
So p-values have these issues,
00:07:24.880 --> 00:07:28.006
they are misunderstood and
00:07:28.006 --> 00:07:30.850
they have some fundamental issues and also
00:07:30.850 --> 00:07:32.830
the way we use p-value,
00:07:32.830 --> 00:07:35.770
to judge, which papers of
publishable and which are not,
00:07:35.770 --> 00:07:37.070
is problematic.
00:07:37.521 --> 00:07:40.360
So there have been some efforts
to address these issues.
00:07:40.360 --> 00:07:43.120
One extreme is that
00:07:43.120 --> 00:07:47.200
a journal and some journals are banning
null hypothesis significance testing.
00:07:47.770 --> 00:07:51.671
So if your article includes any p-values,
00:07:52.170 --> 00:07:58.402
don't try to submit it to Basic and
Applied Social Psychology, for example.
00:07:58.402 --> 00:08:03.910
They will send every article that
has p-values back to the author,
00:08:03.910 --> 00:08:05.860
and tell the authors to remove the p-values.
00:08:06.549 --> 00:08:10.600
Strategic Management Journal
is de-emphasizing p-values too.
00:08:10.600 --> 00:08:14.980
So there is a trend of de-emphasizing
or even outright banning p-value.
00:08:14.980 --> 00:08:16.870
So that's one way of addressing the issue.
00:08:18.105 --> 00:08:24.700
Another way of addressing the
issue is related to the choice of,
00:08:24.700 --> 00:08:26.141
which studies are published?
00:08:26.545 --> 00:08:28.735
So if we only publish studies
00:08:28.735 --> 00:08:31.675
that have small p-values,
00:08:31.841 --> 00:08:35.681
then we will inflate the false positive rates,
00:08:35.752 --> 00:08:38.092
and we will also bias the results.
00:08:38.140 --> 00:08:41.230
Why that's the case I will
explain in another video.
00:08:41.657 --> 00:08:45.010
But a way to address this problem is,
00:08:45.010 --> 00:08:47.350
that studies are registered.
00:08:47.350 --> 00:08:49.600
So before you analyze your data,
00:08:49.600 --> 00:08:51.880
before you collect your sample,
00:08:51.880 --> 00:08:53.620
you write a study plan,
00:08:53.620 --> 00:08:56.530
and then you submit it to
an online repository that,
00:08:58.630 --> 00:09:00.760
tells the readers, what you're planning to do.
00:09:01.164 --> 00:09:02.934
And that will be reviewed,
00:09:02.934 --> 00:09:05.940
instead of your paper with the results.
00:09:05.964 --> 00:09:10.044
Then are when your research
plan is being reviewed,
00:09:10.044 --> 00:09:11.770
then that means that you have actually,
00:09:11.770 --> 00:09:14.590
your study is not being reviewed based on,
00:09:14.590 --> 00:09:17.200
whether the p-values are small or not,
00:09:17.200 --> 00:09:20.890
but it's based, the review is
based on the strength of the design,
00:09:20.890 --> 00:09:25.210
which is a much more meaningful metric for quality
00:09:25.210 --> 00:09:26.830
then the p-value.
00:09:27.186 --> 00:09:29.766
So this is another, registered reports,
00:09:29.766 --> 00:09:33.700
an upcoming trend that you should be aware of.
00:09:34.365 --> 00:09:38.830
There are lots of readings about controversies,
00:09:38.830 --> 00:09:45.910
and I really like the paper by Nuzzo
in Nature about statistical errors.
00:09:45.910 --> 00:09:47.410
It's a three or four-page paper
00:09:47.410 --> 00:09:51.010
that explains these issues that I
went through in this presentation,
00:09:51.010 --> 00:09:54.700
and it's well worth the time reading.
00:09:56.386 --> 00:09:59.680
Then there's the question of, can we do better?
00:09:59.680 --> 00:10:07.120
So the null hypothesis significance testing, the p-value and the confidence interval,
00:10:07.120 --> 00:10:08.770
they suffer from the same problem,
00:10:08.770 --> 00:10:09.670
so can we do better?
00:10:09.670 --> 00:10:12.425
So what alternatives do we have?
00:10:12.425 --> 00:10:15.070
So if we ban null hypothesis significance testing,
00:10:15.070 --> 00:10:17.179
we ban confidence intervals,
00:10:17.179 --> 00:10:19.000
then what's remaining?
00:10:19.000 --> 00:10:22.780
Not reporting anything about
the precision of the estimates,
00:10:22.780 --> 00:10:24.850
that would not be a good idea.
00:10:24.850 --> 00:10:27.100
So can we do better?
00:10:27.100 --> 00:10:29.260
So ultimately we want to know,
00:10:29.260 --> 00:10:33.247
or say something about the truth
value of the effect in the population.
00:10:34.126 --> 00:10:37.120
So instead of saying that the p-value,
00:10:37.120 --> 00:10:40.360
when we observe that we reject a null hypothesis,
00:10:41.215 --> 00:10:42.490
we would like to say,
00:10:42.490 --> 00:10:47.890
how confident we are that the null
hypothesis is true in the population.
00:10:48.650 --> 00:10:51.560
But we don't know that based on the p-value.
00:10:51.560 --> 00:10:54.152
We can know that,
00:10:54.152 --> 00:11:00.940
if we know, what is the distribution of
the true hypothesis in the population?
00:11:01.652 --> 00:11:03.482
So let's take an example.
00:11:03.482 --> 00:11:05.669
Let's take an example of clairvoyance.
00:11:06.951 --> 00:11:10.339
So we have a hypothetical example,
00:11:10.339 --> 00:11:12.940
where we throw two dice,
00:11:12.940 --> 00:11:16.467
and we ask a person to guess the two dice.
00:11:17.156 --> 00:11:19.264
They can either guess the two dice,
00:11:19.264 --> 00:11:20.690
or if they can see the future,
00:11:20.690 --> 00:11:22.730
they will know what the two dice are,
00:11:22.730 --> 00:11:24.665
and they can answer correctly.
00:11:25.425 --> 00:11:28.100
So if a person answers the question of two dice,
00:11:28.100 --> 00:11:30.196
let's say that they are one or six,
00:11:30.196 --> 00:11:32.684
if the person answers the question correctly,
00:11:33.064 --> 00:11:35.060
then we reject the null hypothesis,
00:11:35.060 --> 00:11:39.455
because getting two dice
correctly is one out of 36,
00:11:39.455 --> 00:11:44.087
and that's less than 0.5.
00:11:44.206 --> 00:11:46.697
So either we have a false positive,
00:11:46.697 --> 00:11:51.320
our test had revealed that
the person was guessing,
00:11:51.320 --> 00:11:53.960
or it's a true positive that
the person actually knew
00:11:53.960 --> 00:11:55.040
what the dice are gonna be,
00:11:55.040 --> 00:11:56.697
so he was clairvoyant.
00:11:58.027 --> 00:12:00.155
So what do we do?
00:12:00.155 --> 00:12:03.410
We can if we know that based on the tests,
00:12:03.410 --> 00:12:08.120
that it has 1 out of 36 false-positive rates,
00:12:08.120 --> 00:12:11.225
so guessing correctly is about one out of 36.
00:12:11.605 --> 00:12:18.320
And let's say that this test has a 100
% success rate for clairvoyant people.
00:12:18.747 --> 00:12:23.630
So if the person can foresee the
dice, they must answer correctly.
00:12:26.147 --> 00:12:31.040
If there are one in a million
people that are clairvoyant,
00:12:31.040 --> 00:12:34.862
let's say that it exists
but it's fantastically rare.
00:12:34.862 --> 00:12:36.230
Then we can say that,
00:12:36.230 --> 00:12:39.380
if we rejected the null hypothesis,
00:12:39.380 --> 00:12:45.740
the probability of the person being
clairvoyant is still 1 out of about 28 000.
00:12:45.740 --> 00:12:50.600
The reason being that these people,
00:12:50.600 --> 00:12:54.380
who are not clairvoyant, 999 %,
00:12:54.380 --> 00:12:59.564
999 999 people are not clairvoyant,
00:13:00.846 --> 00:13:06.350
1 out of 36 is here, which is 27 279,
00:13:06.730 --> 00:13:09.190
and the remaining are here.
00:13:09.190 --> 00:13:11.427
There is one clairvoyant person,
00:13:11.427 --> 00:13:12.920
who gets it correctly
00:13:12.920 --> 00:13:14.900
and there are no false negatives.
00:13:14.900 --> 00:13:22.140
So we compare one against 27 779.
00:13:22.140 --> 00:13:25.230
So this is the idea of Bayesian statistics.
00:13:25.230 --> 00:13:27.870
So we include prior information,
00:13:27.870 --> 00:13:31.230
our beliefs about the prior
distribution of the phenomenon
00:13:31.230 --> 00:13:32.220
into our analysis,
00:13:32.220 --> 00:13:36.060
and then we can, based
on that prior information,
00:13:36.060 --> 00:13:39.812
we can say something about the
phenomena that we're studying,
00:13:39.812 --> 00:13:43.417
that goes beyond the p-value.
00:13:43.417 --> 00:13:47.000
The problem, of course, is that,
00:13:47.000 --> 00:13:49.245
how do we know these priors,
00:13:49.245 --> 00:13:53.561
how do we know that our
test has 100% success rate,
00:13:53.561 --> 00:13:58.170
how do we know that there are one in
a million people that are clairvoyant?
00:13:58.526 --> 00:14:01.166
So that's the problem of Bayesian analysis.
00:14:01.380 --> 00:14:02.670
In Bayesian analysis,
00:14:02.670 --> 00:14:05.670
we add information to the analysis based on,
00:14:05.670 --> 00:14:07.389
what we know before the study,
00:14:07.650 --> 00:14:11.730
and that allows us to make inferences
that are slightly different from p-values,
00:14:12.466 --> 00:14:15.046
but the problem is that how would we know?
00:14:17.088 --> 00:14:19.520
The Bayesian analysis sounds attractive,
00:14:19.520 --> 00:14:23.910
but it has been available for a long long time,
00:14:23.910 --> 00:14:25.736
so this is not a new thing.
00:14:25.736 --> 00:14:29.940
And Bayesian analysis has been
coming also for a long long time.
00:14:29.940 --> 00:14:36.611
For example, there's this article now that was published recently in Journal of Management,
00:14:36.611 --> 00:14:41.635
that says that time for Bayesian analysis is now.
00:14:42.015 --> 00:14:45.000
There has been this kind of
articles a long time ago.
00:14:45.000 --> 00:14:51.900
This is something also that
you should be keeping up with.
00:14:51.900 --> 00:14:55.530
So it could be that these Bayesian analyses,
00:14:55.530 --> 00:15:00.990
where we include prior information
to our statistical analysis,
00:15:00.990 --> 00:15:03.210
it becomes more popular.
00:15:03.210 --> 00:15:06.270
But nowadays it's not that commonly used.
00:15:06.270 --> 00:15:07.830
I can't tell you any papers,
00:15:07.830 --> 00:15:09.990
that would have an applied Bayesian logic,
00:15:09.990 --> 00:15:11.896
I've applied it in one paper myself.
00:15:12.086 --> 00:15:15.210
The problem with Bayesian
analysis is not only the priors,
00:15:15.210 --> 00:15:16.740
which you would have to know,
00:15:16.740 --> 00:15:24.089
it's also that because these are not
commonly used in business studies then,
00:15:24.089 --> 00:15:27.930
reviewers who get these studies on their desks,
00:15:27.930 --> 00:15:29.070
don't know what to do with them.
00:15:29.569 --> 00:15:31.609
So we don't know, how to properly evaluate,
00:15:31.609 --> 00:15:34.530
perhaps we are sceptical of
those studies, for that reason.
00:15:35.124 --> 00:15:37.684
So there's kind of like a chicken-and-egg problem.
00:15:37.684 --> 00:15:42.220
This could be in some ways a better
approach to doing statistical analysis,
00:15:42.220 --> 00:15:46.364
but because we have not been doing it in the past,
00:15:46.554 --> 00:15:48.264
we're not doing it now.
00:15:48.335 --> 00:15:50.525
Another pragmatic approach is,
00:15:50.525 --> 00:15:52.750
a pragmatic way of thinking about this is that,
00:15:52.750 --> 00:15:54.430
you will have to know p-values anyway,
00:15:54.430 --> 00:15:58.270
because 99% of the studies that are published,
00:15:58.270 --> 00:16:00.751
at least, use p-values.
00:16:00.751 --> 00:16:03.415
Once you know p-value based statistics,
00:16:03.510 --> 00:16:06.180
then you can start publishing yourself.
00:16:06.180 --> 00:16:10.270
Or you could spend an additional
year and learning Bayes analysis.
00:16:10.270 --> 00:16:12.820
Most people will choose to go publishing,
00:16:12.820 --> 00:16:16.930
because you are a PhD and then
your tenure depends on that.