WEBVTT 00:00:00.060 --> 00:00:08.940 Now we start talking about statistical inference.  Which refers to the task of making some kind of   00:00:08.940 --> 00:00:15.540 statistical claims about the populace. We don't yet discuss causality or causal   00:00:15.540 --> 00:00:21.120 claims but just making claims that there  is an association between two variables   00:00:21.120 --> 00:00:26.280 or difference between two groups in  a population based on sample data.  00:00:26.280 --> 00:00:33.210 And our example now comes from the Talouselämä  500 -magazine, that I covered in a previous   00:00:33.210 --> 00:00:40.410 video. This is a Finnish Business Magazine  that follows the 500 largest Finnish companies.  00:00:40.410 --> 00:00:47.820 And in one particular year in 2005  there were big headlines in Finnish   00:00:47.820 --> 00:00:55.890 newspapers because on this list the average  return on assets of women-led companies was   00:00:55.890 --> 00:01:05.160 4.7% points higher than in men-led companies. So our question now is we have a observation   00:01:05.160 --> 00:01:11.910 the return on assets difference of 4.7%  points and is it a big deal. Does it matter?  00:01:13.140 --> 00:01:19.020 What does the data tell us and what kind  of inferences can we make from this sample?  00:01:19.020 --> 00:01:29.970 4,7 % points is pretty big difference. So what?  What does it mean? What the data are and what   00:01:29.970 --> 00:01:40.230 the data tell us directly is, that in one point of  time, in one sample that the firm's led by women   00:01:40.230 --> 00:01:48.960 are more profitable. That's what the data tells  us and now the question is can we generalize.  00:01:48.960 --> 00:01:57.300 Can we say something beyond that particular  sample? Can we say that this generalizes to   00:01:57.300 --> 00:02:03.660 other years or is it just one year? If it's just  one year, and the man-led companies happen to be   00:02:03.660 --> 00:02:08.100 more profitable but in wouldn't generalize  to other years, then it's not a big deal.  00:02:08.100 --> 00:02:16.320 If it generalizes to other years then it's  probably a big deal. The second question is does   00:02:16.320 --> 00:02:22.020 it generalize to other firms. Is it just these  500 companies in which the women-led companies   00:02:22.020 --> 00:02:30.360 are more profitable or does it generalize to the  thousand largest companies or all companies in   00:02:30.360 --> 00:02:37.890 Finland or all companies in all countries. How  do we generalize, how widely can we generalize?  00:02:37.890 --> 00:02:44.910 The first question that we need to ask  when we start discussing generalizability   00:02:44.910 --> 00:02:48.210 of a sample statistic. This is a sample statistic.  00:02:48.870 --> 00:02:55.020 It's something calculated a number from a sample. Does it generalize the population? We have to ask   00:02:55.020 --> 00:03:05.520 could this be by chance only? Is it possible that  because of sampling variation we just happen to   00:03:05.520 --> 00:03:12.720 have, the companies that were led by women, happen  to have a better year than companies that were   00:03:12.720 --> 00:03:20.160 led by men? Could it be just random occurrence  or is it evidence of a systematic difference?   00:03:20.160 --> 00:03:27.960 And we have to ask two important questions to  answer that, whether it can be by chance only.  00:03:27.960 --> 00:03:37.080 The one is: is 4.7% points a large difference?  Large differences really occur by chance only,   00:03:37.080 --> 00:03:44.550 small differences occur by chance only frequently.  When we calculate something from a sample the   00:03:44.550 --> 00:03:49.890 sample estimate is hardly ever exactly the  population value. It's somewhere close.  00:03:49.890 --> 00:03:55.350 So is it far enough to say that  it's improbable that this kind   00:03:55.350 --> 00:04:00.840 of result could occur by chance only. Or is it close enough to their population   00:04:00.840 --> 00:04:09.510 value that it actually makes no difference.  Then we have to look at, is it a large affect.  00:04:09.510 --> 00:04:18.210 The mean ROA is about 10 in this sample. And 4.7%  point difference would mean that if the men-led   00:04:18.210 --> 00:04:26.010 companies are let's say 8 % ROA then women-led  companies are 13% ROA. So they are more than   00:04:26.010 --> 00:04:33.150 50% more profitable than men-led companies,  that's a big thing. That's a big difference.  00:04:33.150 --> 00:04:40.470 The second important question relates to  sample size. We know that the full sample is   00:04:40.470 --> 00:04:46.200 500 companies but that's not the full story. We  also have to consider how many women there are.  00:04:46.200 --> 00:04:54.690 If there are just five women or if there are  250 women that those two conditions would lead   00:04:54.690 --> 00:05:01.320 to very different conclusions. It happens to  be that there were 22 women in the sample so   00:05:01.320 --> 00:05:11.160 that's fairly small number of observations.  Now the question of statistical inference.  00:05:11.160 --> 00:05:20.520 We want to see if there is actually, if this  return on assets of 4.7% point is it large enough   00:05:20.520 --> 00:05:25.050 that we can conclude that there probably  is a difference, a systematic difference.  00:05:25.050 --> 00:05:31.530 And this is not due to sampling fluctuations  only. We have to ask the question what would   00:05:31.530 --> 00:05:36.150 be the probability of getting this  kind of difference by chance only.  00:05:36.150 --> 00:05:42.900 You watched the video about John Rauser.  What would John Rauser do in this scenario?  00:05:42.900 --> 00:05:51.870 We have 500 companies we want to know whether the  difference between the women-led companies and the   00:05:51.870 --> 00:05:59.970 man-led companies could occur by chance only. What we do is, one strategy of answering that   00:05:59.970 --> 00:06:08.100 question is to do a permutation analysis or a  permutation test which is a fairly intuitive way   00:06:08.100 --> 00:06:15.480 of understanding statistical testing. And what  we do is that we take the list of companies.  00:06:15.480 --> 00:06:21.960 And we have the largest companies, I got  the data from a database this may not be   00:06:21.960 --> 00:06:27.630 the exact same 500 companies, but it doesn't  matter for the example. We choose 22 companies   00:06:27.630 --> 00:06:33.840 at random and we compare the remaining  478 and we calculate the difference.  00:06:33.840 --> 00:06:39.120 So we have 22 companies again, a  mean of 22 companies compared to   00:06:39.120 --> 00:06:45.870 mean of 478 companies. We repeat 10,000  times and we see what's the difference.  00:06:45.870 --> 00:06:52.680 What is the probability of getting at least  4.7% point difference in these comparisons? 00:06:52.680 --> 00:06:59.430 So let's take a look at the results. I did the  analysis, here are the first 200 comparisons.   00:06:59.430 --> 00:07:07.740 We can see that quite often when we take randomly  22 companies and compare against the 478 remaining   00:07:07.740 --> 00:07:13.500 companies the difference is very close to zero.  So here is very close to zero, no difference. 00:07:13.500 --> 00:07:17.970 Sometimes we get a negative difference  here. So the difference actually,   00:07:17.970 --> 00:07:23.520 there's no systematic difference there cannot  be because I chose companies randomly and two   00:07:23.520 --> 00:07:32.250 random samples are always comparable. But  we get these differences larger than 4.7,   00:07:32.250 --> 00:07:40.620 we get 9/200 comparisons using this permutation  testing strategy. So the probability of getting   00:07:40.620 --> 00:07:51.720 4.7% points difference or larger in this  test is 0.045 for the first 200 observations. 00:07:51.720 --> 00:08:02.070 Is that enough evidence to conclude that the 4.7%  point difference is unlikely to be by chance only? 00:08:02.070 --> 00:08:07.860 Let's take a look at the bigger picture. So  we have the distribution of the estimates   00:08:07.860 --> 00:08:16.170 and we have 10,000 repeated samples. And  sometimes we get a large negative estimate,   00:08:16.170 --> 00:08:21.900 sometimes we get a large positive estimate,  typically we get an estimate where there is   00:08:21.900 --> 00:08:25.830 no difference because there should not be  any. Because we are taking a random sample   00:08:25.830 --> 00:08:31.320 from population comparing to another sample  there should be, because of randomization there   00:08:31.320 --> 00:08:45.000 shouldn't be any differences. The probability for getting 4.7% points or higher difference is 0.0347/10,000 replications. 00:08:45.000 --> 00:08:47.568 This probability  is called the p-value. 00:08:47.568 --> 00:08:58.456 It is the probability of  observing an effect equally large or greater under there being no effect. 00:08:58.456 --> 00:09:03.090 We can also, we don't have to do the permutation testing, we don't have to do the random sampling. 00:09:03.090 --> 00:09:12.690 Because this shape here looks familiar so that's the normal distribution. We see that the difference is normally distributed   00:09:12.690 --> 00:09:18.095 and many things are, many in statistics they  follow normal distribution. 00:09:18.095 --> 00:09:27.068 So we can just, instead of approximating this difference by taking  random samples, we only need to find out what is   00:09:27.068 --> 00:09:32.654 the right normal distribution so where do we draw the distribution. And then compare against that normal distribution. 00:09:32.654 --> 00:09:39.390 So here's the normal  distribution, overlaid against that observed,   00:09:39.390 --> 00:09:44.910 if observed distribution of estimates. Here  we have the mean of the normal distributions,   00:09:44.910 --> 00:09:53.160 we see here that's zero, so that's our base case of no difference. And then normal distribution,   00:09:53.160 --> 00:10:03.546 also we need to know the dispersion the standard deviation. And this standard deviation is estimated using the standard error. 00:10:03.546 --> 00:10:07.801 We have the  standard error which the statistical software will print out for us. 00:10:07.801 --> 00:10:15.568 We draw a normal distribution  mean at 0 which is the null hypothesis value of no difference. 00:10:15.568 --> 00:10:20.220 Then we have this person here  which is the quantified by the standard error.   00:10:20.220 --> 00:10:28.290 Then we compare how probable, what is the size of this area here. How probable is it to get an   00:10:28.290 --> 00:10:43.380 estimate of 4.7% points or higher given the null hypothesis? 0.04 that is less than 0.05 which is   00:10:43.380 --> 00:10:52.073 the normal criterion for statistical inference,  for statistical significance. 00:10:52.073 --> 00:10:58.980 Could it be by chance only? P is less than 0.05? If this was a research paper we would conclude that there   00:10:58.980 --> 00:11:03.540 is a statistically significant difference and we would write a paper. We would get it   00:11:03.540 --> 00:11:08.370 hopefully published somewhere because we have a statistically significant result. 00:11:08.370 --> 00:11:14.680 Of course  we have to think that, there in this particular scenario, there are probably reporters who want   00:11:14.680 --> 00:11:21.040 to say something positive about women. So they could do multiple comparisons they could do   00:11:21.040 --> 00:11:30.280 comparisons of growth, profitability, other important statistics. And if they happen to   00:11:30.280 --> 00:11:39.280 find one statistic that makes women look better,  then they'd write up a newspaper article about it.   00:11:39.280 --> 00:11:44.440 So the p-values work well when you do multiple  compare, when you do it's just one comparison.   00:11:44.440 --> 00:11:51.760 But because of the nature of the test we will  get eventually large effects by chance only.   00:11:51.760 --> 00:11:59.710 If we repeat this study for example every year,  we check profitability and we check liquidity,   00:11:59.710 --> 00:12:05.830 we check growth and we do that over ten years so we have 30 comparisons. One of those comparisons   00:12:05.830 --> 00:12:16.645 will almost certainly give us P is less than 0.05 by chance only. So P is less than 0.05 is not very strong evidence. 00:12:16.645 --> 00:12:24.430 It is some evidence if it is just  one comparison. But if we do multiple comparisons   00:12:24.430 --> 00:12:31.060 we can do this kind of data mining and always get  something P is less than 0.05. 00:12:31.060 --> 00:12:37.360 If we would have less than 0.001 then I would buy the claim that  there is actually an effect in the population.