WEBVTT
00:00:00.060 --> 00:00:08.940
Now we start talking about statistical inference.
Which refers to the task of making some kind of
00:00:08.940 --> 00:00:15.540
statistical claims about the populace.
We don't yet discuss causality or causal
00:00:15.540 --> 00:00:21.120
claims but just making claims that there
is an association between two variables
00:00:21.120 --> 00:00:26.280
or difference between two groups in
a population based on sample data.
00:00:26.280 --> 00:00:33.210
And our example now comes from the TalouselĂ¤mĂ¤
500 -magazine, that I covered in a previous
00:00:33.210 --> 00:00:40.410
video. This is a Finnish Business Magazine
that follows the 500 largest Finnish companies.
00:00:40.410 --> 00:00:47.820
And in one particular year in 2005
there were big headlines in Finnish
00:00:47.820 --> 00:00:55.890
newspapers because on this list the average
return on assets of women-led companies was
00:00:55.890 --> 00:01:05.160
4.7% points higher than in men-led companies.
So our question now is we have a observation
00:01:05.160 --> 00:01:11.910
the return on assets difference of 4.7%
points and is it a big deal. Does it matter?
00:01:13.140 --> 00:01:19.020
What does the data tell us and what kind
of inferences can we make from this sample?
00:01:19.020 --> 00:01:29.970
4,7 % points is pretty big difference. So what?
What does it mean? What the data are and what
00:01:29.970 --> 00:01:40.230
the data tell us directly is, that in one point of
time, in one sample that the firm's led by women
00:01:40.230 --> 00:01:48.960
are more profitable. That's what the data tells
us and now the question is can we generalize.
00:01:48.960 --> 00:01:57.300
Can we say something beyond that particular
sample? Can we say that this generalizes to
00:01:57.300 --> 00:02:03.660
other years or is it just one year? If it's just
one year, and the man-led companies happen to be
00:02:03.660 --> 00:02:08.100
more profitable but in wouldn't generalize
to other years, then it's not a big deal.
00:02:08.100 --> 00:02:16.320
If it generalizes to other years then it's
probably a big deal. The second question is does
00:02:16.320 --> 00:02:22.020
it generalize to other firms. Is it just these
500 companies in which the women-led companies
00:02:22.020 --> 00:02:30.360
are more profitable or does it generalize to the
thousand largest companies or all companies in
00:02:30.360 --> 00:02:37.890
Finland or all companies in all countries. How
do we generalize, how widely can we generalize?
00:02:37.890 --> 00:02:44.910
The first question that we need to ask
when we start discussing generalizability
00:02:44.910 --> 00:02:48.210
of a sample statistic.
This is a sample statistic.
00:02:48.870 --> 00:02:55.020
It's something calculated a number from a sample.
Does it generalize the population? We have to ask
00:02:55.020 --> 00:03:05.520
could this be by chance only? Is it possible that
because of sampling variation we just happen to
00:03:05.520 --> 00:03:12.720
have, the companies that were led by women, happen
to have a better year than companies that were
00:03:12.720 --> 00:03:20.160
led by men? Could it be just random occurrence
or is it evidence of a systematic difference?
00:03:20.160 --> 00:03:27.960
And we have to ask two important questions to
answer that, whether it can be by chance only.
00:03:27.960 --> 00:03:37.080
The one is: is 4.7% points a large difference?
Large differences really occur by chance only,
00:03:37.080 --> 00:03:44.550
small differences occur by chance only frequently.
When we calculate something from a sample the
00:03:44.550 --> 00:03:49.890
sample estimate is hardly ever exactly the
population value. It's somewhere close.
00:03:49.890 --> 00:03:55.350
So is it far enough to say that
it's improbable that this kind
00:03:55.350 --> 00:04:00.840
of result could occur by chance only.
Or is it close enough to their population
00:04:00.840 --> 00:04:09.510
value that it actually makes no difference.
Then we have to look at, is it a large affect.
00:04:09.510 --> 00:04:18.210
The mean ROA is about 10 in this sample. And 4.7%
point difference would mean that if the men-led
00:04:18.210 --> 00:04:26.010
companies are let's say 8 % ROA then women-led
companies are 13% ROA. So they are more than
00:04:26.010 --> 00:04:33.150
50% more profitable than men-led companies,
that's a big thing. That's a big difference.
00:04:33.150 --> 00:04:40.470
The second important question relates to
sample size. We know that the full sample is
00:04:40.470 --> 00:04:46.200
500 companies but that's not the full story. We
also have to consider how many women there are.
00:04:46.200 --> 00:04:54.690
If there are just five women or if there are
250 women that those two conditions would lead
00:04:54.690 --> 00:05:01.320
to very different conclusions. It happens to
be that there were 22 women in the sample so
00:05:01.320 --> 00:05:11.160
that's fairly small number of observations.
Now the question of statistical inference.
00:05:11.160 --> 00:05:20.520
We want to see if there is actually, if this
return on assets of 4.7% point is it large enough
00:05:20.520 --> 00:05:25.050
that we can conclude that there probably
is a difference, a systematic difference.
00:05:25.050 --> 00:05:31.530
And this is not due to sampling fluctuations
only. We have to ask the question what would
00:05:31.530 --> 00:05:36.150
be the probability of getting this
kind of difference by chance only.
00:05:36.150 --> 00:05:42.900
You watched the video about John Rauser.
What would John Rauser do in this scenario?
00:05:42.900 --> 00:05:51.870
We have 500 companies we want to know whether the
difference between the women-led companies and the
00:05:51.870 --> 00:05:59.970
man-led companies could occur by chance only.
What we do is, one strategy of answering that
00:05:59.970 --> 00:06:08.100
question is to do a permutation analysis or a
permutation test which is a fairly intuitive way
00:06:08.100 --> 00:06:15.480
of understanding statistical testing. And what
we do is that we take the list of companies.
00:06:15.480 --> 00:06:21.960
And we have the largest companies, I got
the data from a database this may not be
00:06:21.960 --> 00:06:27.630
the exact same 500 companies, but it doesn't
matter for the example. We choose 22 companies
00:06:27.630 --> 00:06:33.840
at random and we compare the remaining
478 and we calculate the difference.
00:06:33.840 --> 00:06:39.120
So we have 22 companies again, a
mean of 22 companies compared to
00:06:39.120 --> 00:06:45.870
mean of 478 companies. We repeat 10,000
times and we see what's the difference.
00:06:45.870 --> 00:06:52.680
What is the probability of getting at least
4.7% point difference in these comparisons?
00:06:52.680 --> 00:06:59.430
So let's take a look at the results. I did the
analysis, here are the first 200 comparisons.
00:06:59.430 --> 00:07:07.740
We can see that quite often when we take randomly
22 companies and compare against the 478 remaining
00:07:07.740 --> 00:07:13.500
companies the difference is very close to zero.
So here is very close to zero, no difference.
00:07:13.500 --> 00:07:17.970
Sometimes we get a negative difference
here. So the difference actually,
00:07:17.970 --> 00:07:23.520
there's no systematic difference there cannot
be because I chose companies randomly and two
00:07:23.520 --> 00:07:32.250
random samples are always comparable. But
we get these differences larger than 4.7,
00:07:32.250 --> 00:07:40.620
we get 9/200 comparisons using this permutation
testing strategy. So the probability of getting
00:07:40.620 --> 00:07:51.720
4.7% points difference or larger in this
test is 0.045 for the first 200 observations.
00:07:51.720 --> 00:08:02.070
Is that enough evidence to conclude that the 4.7%
point difference is unlikely to be by chance only?
00:08:02.070 --> 00:08:07.860
Let's take a look at the bigger picture. So
we have the distribution of the estimates
00:08:07.860 --> 00:08:16.170
and we have 10,000 repeated samples. And
sometimes we get a large negative estimate,
00:08:16.170 --> 00:08:21.900
sometimes we get a large positive estimate,
typically we get an estimate where there is
00:08:21.900 --> 00:08:25.830
no difference because there should not be
any. Because we are taking a random sample
00:08:25.830 --> 00:08:31.320
from population comparing to another sample
there should be, because of randomization there
00:08:31.320 --> 00:08:45.000
shouldn't be any differences. The probability for getting 4.7% points or higher difference is 0.0347/10,000 replications.
00:08:45.000 --> 00:08:47.568
This probability
is called the p-value.
00:08:47.568 --> 00:08:58.456
It is the probability of observing an effect equally large or greater under there being no effect.
00:08:58.456 --> 00:09:03.090
We can also, we don't have to do the permutation testing, we don't have to do the random sampling.
00:09:03.090 --> 00:09:12.690
Because this shape here looks familiar so that's the normal distribution. We see that the difference is normally distributed
00:09:12.690 --> 00:09:18.095
and many things are, many in statistics they
follow normal distribution.
00:09:18.095 --> 00:09:27.068
So we can just, instead of approximating this difference by taking
random samples, we only need to find out what is
00:09:27.068 --> 00:09:32.654
the right normal distribution so where do we draw the distribution. And then compare against that normal distribution.
00:09:32.654 --> 00:09:39.390
So here's the normal
distribution, overlaid against that observed,
00:09:39.390 --> 00:09:44.910
if observed distribution of estimates. Here
we have the mean of the normal distributions,
00:09:44.910 --> 00:09:53.160
we see here that's zero, so that's our base case of no difference.
And then normal distribution,
00:09:53.160 --> 00:10:03.546
also we need to know the dispersion the standard deviation.
And this standard deviation is estimated using the standard error.
00:10:03.546 --> 00:10:07.801
We have the standard error which the statistical software will print out for us.
00:10:07.801 --> 00:10:15.568
We draw a normal distribution
mean at 0 which is the null hypothesis value of no difference.
00:10:15.568 --> 00:10:20.220
Then we have this person here
which is the quantified by the standard error.
00:10:20.220 --> 00:10:28.290
Then we compare how probable, what is the size of this area here.
How probable is it to get an
00:10:28.290 --> 00:10:43.380
estimate of 4.7% points or higher given the null hypothesis?
0.04 that is less than 0.05 which is
00:10:43.380 --> 00:10:52.073
the normal criterion for statistical inference,
for statistical significance.
00:10:52.073 --> 00:10:58.980
Could it be by chance only? P is less than 0.05?
If this was a research paper we would conclude that there
00:10:58.980 --> 00:11:03.540
is a statistically significant difference and we would write a paper.
We would get it
00:11:03.540 --> 00:11:08.370
hopefully published somewhere because we have a statistically significant result.
00:11:08.370 --> 00:11:14.680
Of course we have to think that, there in this particular scenario, there are probably reporters who want
00:11:14.680 --> 00:11:21.040
to say something positive about women.
So they could do multiple comparisons they could do
00:11:21.040 --> 00:11:30.280
comparisons of growth, profitability, other important statistics.
And if they happen to
00:11:30.280 --> 00:11:39.280
find one statistic that makes women look better,
then they'd write up a newspaper article about it.
00:11:39.280 --> 00:11:44.440
So the p-values work well when you do multiple
compare, when you do it's just one comparison.
00:11:44.440 --> 00:11:51.760
But because of the nature of the test we will
get eventually large effects by chance only.
00:11:51.760 --> 00:11:59.710
If we repeat this study for example every year,
we check profitability and we check liquidity,
00:11:59.710 --> 00:12:05.830
we check growth and we do that over ten years so we have 30 comparisons. One of those comparisons
00:12:05.830 --> 00:12:16.645
will almost certainly give us P is less than 0.05 by chance only.
So P is less than 0.05 is not very strong evidence.
00:12:16.645 --> 00:12:24.430
It is some evidence if it is just
one comparison. But if we do multiple comparisons
00:12:24.430 --> 00:12:31.060
we can do this kind of data mining and always get
something P is less than 0.05.
00:12:31.060 --> 00:12:37.360
If we would have less than 0.001 then I would buy the claim that
there is actually an effect in the population.