WEBVTT Kind: captions Language: en 00:00:00.180 --> 00:00:06.630 When we do statistical analysis we always get the point estimate or the estimate of the effect, 00:00:06.630 --> 00:00:12.060 just one regression coefficient, or one number. We also need to know how certain we are about 00:00:12.060 --> 00:00:15.840 that number, and that certainty is quantified by the standard error. 00:00:15.840 --> 00:00:20.730 So that standard error quantifies the precision and we use the standard error 00:00:20.730 --> 00:00:25.980 and the actual estimate to calculate it a statistics that give us the p-values. 00:00:25.980 --> 00:00:32.460 Some in some scenarios calculating the standard error is hard, or calculating 00:00:32.460 --> 00:00:37.200 the standard error is something that requires assumptions that we are not willing to make, 00:00:37.200 --> 00:00:42.600 or assumptions that we know that they are not true for our particular data and analysis. 00:00:42.600 --> 00:00:47.940 Bootstrapping provides an alternative way of calculating standard errors or estimating how much 00:00:47.940 --> 00:00:54.060 a statistic would vary from one sample to another. And bootstrapping is like a computational approach 00:00:54.060 --> 00:00:59.760 to the problem of calculating a standard error. How bootstrapping works is that we have our 00:00:59.760 --> 00:01:04.590 original sample. So we have a sample of 10 observations here from a normally 00:01:04.590 --> 00:01:08.070 distributed population with mean of 0 and standard deviation of 1. 00:01:08.070 --> 00:01:14.640 So that's our original sample here, the mean is zero point 13 from that sample, and if we take our 00:01:14.640 --> 00:01:20.880 multiple samples from the same population. Here is the sampling distribution of the 00:01:20.880 --> 00:01:25.350 of the sample mean if the sample size is 10 from this population. 00:01:25.350 --> 00:01:30.930 Most of the time we get values close to the 0 which is the population value of mean. 00:01:30.930 --> 00:01:36.540 And then we have some it sometimes we get estimates that are far from 00:01:36.540 --> 00:01:41.340 are from the actual population value. The idea of bootstrapping is that if 00:01:41.340 --> 00:01:49.770 we don't know how we estimate this are the width of this sampling distribution or the 00:01:49.770 --> 00:01:55.350 shape using statistical theory, or are closed for equation, then we can do the temperately. 00:01:55.350 --> 00:02:02.340 So instead of our calculating it using an equation we take repeated samples from our original sample. 00:02:02.340 --> 00:02:06.000 So that's our original sample it forms the population for the bootstrap. 00:02:06.000 --> 00:02:15.870 Then we take a repeated sample so we take first or 0.31 it is our here, then we put it back so 00:02:15.870 --> 00:02:20.760 we allow every observation to include it be included in the sample multiple times. 00:02:20.760 --> 00:02:27.390 Then we take randomly another another one 0.83, it's here we put it back, 00:02:27.390 --> 00:02:31.860 then we take yet another number yet another number, we take the zero 00:02:31.860 --> 00:02:37.590 point - 0.84 the second time. And so on. So we take these these samples from an 00:02:37.590 --> 00:02:42.720 original data and every observation can be included in the in the sample multiple times. 00:02:42.720 --> 00:02:49.260 So each of these are randomly chosen numbers doesn't depend on any any other previous choices. 00:02:49.260 --> 00:02:56.910 So we get our using this bootstrap sample we get 0.34 a sample mean, we calculated it many many 00:02:56.910 --> 00:03:01.470 times typically we do one hundred five hundred or thousand times or even ten thousand times 00:03:01.470 --> 00:03:08.040 ,depending on the complexity of the calculation. Thousand repetitions is quite normal nowadays. 00:03:08.040 --> 00:03:14.820 So we can see that from sample to sample this on sample mean varies, and these are various of 00:03:14.820 --> 00:03:21.060 sample mean, calculator the distribution of this sample mean from the bootstrap samples, calculated 00:03:21.060 --> 00:03:28.800 from our thousand bootstrap replications here is about the same shape as that if we would 00:03:28.800 --> 00:03:35.550 take the samples from the actual population. So these two our distributions are quite similar 00:03:35.550 --> 00:03:42.930 and we can use that information the knowledge that these two distributions are similar, they approach 00:03:42.930 --> 00:03:49.230 each other when the sample size increases we can use that knowledge to say that this 00:03:49.230 --> 00:03:55.500 our distribution here is a good representation of that distribution and if we want to estimate the 00:03:55.500 --> 00:04:01.470 standard deviation of this distribution, which is what standard error quantifies or estimates, 00:04:01.470 --> 00:04:07.380 then we can just use the standard deviation of that distribution. Here we can see that the mean 00:04:07.380 --> 00:04:11.220 of this distribution is slightly off. That's called the bootstrap bias. 00:04:11.220 --> 00:04:18.420 So this mean here is roughly at the mean here. So it's not that the population mean in instead 00:04:18.420 --> 00:04:21.360 of if it's closer that they are the mean of this particular sample. 00:04:21.360 --> 00:04:27.850 Then we they're also the width of this distribution is in this case slightly smaller, 00:04:27.850 --> 00:04:31.420 so that this person here is slightly smaller than the dispersion here. 00:04:31.420 --> 00:04:37.870 And that is also uh something that we in sometimes need to take in the consideration. 00:04:37.870 --> 00:04:43.360 The key thing in bootstrapping is that one sample size increases then this mean and 00:04:43.360 --> 00:04:47.230 the standard deviation will be closer to did that mean and that standard deviation. 00:04:47.800 --> 00:04:53.410 Let's take a look at a demonstration of how bootstrapping works this is 00:04:53.410 --> 00:04:59.080 a video from our from a Statistics Department from University of Auckland. 00:04:59.080 --> 00:05:05.320 And they demonstrate that you have your original sample here, so we have two variables. 00:05:05.320 --> 00:05:13.270 We have up there on X variable and Y variable and then we have a regression coefficient. 00:05:13.270 --> 00:05:19.120 So we calculate the regression coefficient here and we are interested in how much this 00:05:19.120 --> 00:05:24.850 regression coefficient the slope would bury, if we were to take this sample 00:05:24.850 --> 00:05:28.750 over and over from the same population. So that's what the standard error quantifies 00:05:28.750 --> 00:05:34.300 for some reason we don't want to cut use the normal formula that our statistical software 00:05:34.300 --> 00:05:37.810 uses to calculate the standard error. We want to do it by bootstrapping. 00:05:37.810 --> 00:05:44.620 So we take some samples from our original data. So we take samples from your original data 00:05:44.620 --> 00:05:54.040 like so you can see our here that eats observation can be included multiple times. 00:05:54.040 --> 00:05:59.350 Sometimes an observation is not included in the sample then we get the regression coefficient that 00:05:59.350 --> 00:06:04.420 is slightly different from the original one. We do another bootstrap sample, 00:06:04.420 --> 00:06:08.350 we get another regression coefficient again slightly different from the original one. 00:06:08.350 --> 00:06:19.570 We take yet another bootstrap sample, we get slightly different one and we go 00:06:19.570 --> 00:06:25.570 on a hundred times a thousand times, and ultimately we get an estimate of how much 00:06:25.570 --> 00:06:30.880 this regression coefficient would really vary if we were to take multiple different samples. 00:06:41.880 --> 00:06:46.380 So that's that's when you get a thousand samples or a hundred samples. 00:06:46.380 --> 00:06:53.070 Then you can see that the variance of the rigors and coefficient is that much between 00:06:53.070 --> 00:06:58.500 the bootstrap samples, and if sample size is large enough this variation of the bootstrap 00:06:58.500 --> 00:07:04.650 samples is a good approximation of how much the regression coefficient would vary, if we were to 00:07:04.650 --> 00:07:10.290 repeat the same independent samples from the same population our calculate the regression analysis 00:07:10.290 --> 00:07:16.260 again and again from those independent samples. Bootstrapping it can be used to calculate the 00:07:16.260 --> 00:07:21.450 standard error within which case we just take a standard deviation of these recursive slopes and 00:07:21.450 --> 00:07:25.770 then that is our standard error estimate. We can also use bootstrapping to 00:07:25.770 --> 00:07:29.700 calculate confidence intervals. So the idea of a confidence interval is 00:07:29.700 --> 00:07:36.210 that our instead of estimating a standard error and a p-value, we estimate a point estimate. 00:07:36.210 --> 00:07:43.740 So for example a value of a correlation one single value and then we estimate arm an interval let's 00:07:43.740 --> 00:07:50.640 say 95% a interval which has an upper limit and lower limit and then if we repeat the calculation 00:07:50.640 --> 00:07:57.120 many many times from independent samples. Then the population value will be within the interval 00:07:57.120 --> 00:08:03.630 if it's a valid interval 95% of the files. So this is an example of correlation and we 00:08:03.630 --> 00:08:08.430 can see that the correlation estimates when there is a zero correlation in the population 00:08:08.430 --> 00:08:13.110 we have a small sample size they vary between zero point minus 0.2 and plus zero point two. 00:08:13.110 --> 00:08:20.340 And most of the time when we draw the confidence interval which is the bar the line here the 00:08:20.340 --> 00:08:24.660 line includes the population value. This is two and a half percent of the 00:08:24.660 --> 00:08:31.020 replications here, and it doesn't include the population values. So the population 00:08:31.020 --> 00:08:37.650 value here falls above there are the upper limit. Here we have extremely large correlations and the 00:08:37.650 --> 00:08:44.970 population value for about two and a half percent of the replications falls below the lower limit. 00:08:44.970 --> 00:08:50.910 In 95% of the cases here there are population value is within the interval. 00:08:50.910 --> 00:08:56.820 So that's our the idea of confidence intervals. Here we can see them when the police value is 00:08:56.820 --> 00:09:04.350 large then the width of the confidence interval depends on the correlation estimate. So when the 00:09:04.350 --> 00:09:09.720 correlation estimate is very high then there are the confidence intervals it's narrow. When 00:09:09.720 --> 00:09:15.570 the corrosion estimate is very low then it's a lot wider here they are the confidence interval. 00:09:15.570 --> 00:09:20.400 So the confidence interval depends on the value of the statistic and also it depends 00:09:20.400 --> 00:09:27.630 on the estimated standard error of the statistic. Now there are a couple of ways that bootstrapping 00:09:27.630 --> 00:09:31.020 can be used for calculating your confidence interval in. 00:09:31.020 --> 00:09:34.440 Normally when we do confidence intervals we use the normal approximation. 00:09:34.440 --> 00:09:38.430 So the idea is that our we assume that the estimate is 00:09:38.430 --> 00:09:44.070 normally distributed over repeated samples. Then we calculate the confidence interval, 00:09:44.070 --> 00:09:51.690 it is our estimate plus or minus 1.96 which covers 95% of the normal distribution 00:09:51.690 --> 00:09:57.750 multiplied by the standard error. So that gives us the plus or minus. 00:09:57.750 --> 00:10:04.320 So if we have an estimate of correlation that is here, then we multiply the standard 00:10:04.320 --> 00:10:11.910 error or by 1.96 - estimate minus. That is the load limit estimate plus 00:10:11.910 --> 00:10:17.580 one point nine times the standard error is here. So that gives us the upper and lower limit in this 00:10:17.580 --> 00:10:23.400 example or one percent and 13 percent, when the actual estimate is about our 5 percent. 00:10:23.400 --> 00:10:30.150 So we calculate how we use bootstrapping for this calculation is that there there 00:10:30.150 --> 00:10:33.630 the standard error is simply the standard deviation of the bootstrap estimate. 00:10:33.630 --> 00:10:39.090 So if we take a correlation with bootstrap it, then we calculate how much their 00:10:39.090 --> 00:10:44.520 correlation varies between the bootstrap samples using standard deviation metric, 00:10:44.520 --> 00:10:50.820 and then we use that plug that plug that in. That formula gives us the confidence intervals. 00:10:50.820 --> 00:10:56.940 So that works when we can assume that the estimate is not a disability. 00:10:56.940 --> 00:11:00.480 What if we can't assume that the estimate you know. 00:11:00.480 --> 00:11:05.940 That is the case when we are can use empirical confidence intervals based on bootstrapping. 00:11:05.940 --> 00:11:12.780 So the idea of the normal approximation interval is that the estimate is normally distributed. 00:11:12.780 --> 00:11:17.820 Then we can use this equation or we can use empirical confidence intervals. 00:11:17.820 --> 00:11:24.570 The idea of an implicit confidence interval is that we we do the bootstrapping and then we let's 00:11:24.570 --> 00:11:31.560 say we take our thousand bootstrap replications. Then we take the 25th from smallest to largest, 00:11:31.560 --> 00:11:37.350 we take the 25th value of the bootstrap replicates and that is 00:11:37.350 --> 00:11:45.180 our lower limit for the confidence interval. Then we take the 9705th and that is our the 00:11:45.180 --> 00:11:49.980 upper limit so that's our two and a half percent a 97 a half percent and that's the 00:11:49.980 --> 00:11:54.570 upper limit of our or confidence interval. So that's called percentage intervals. 00:11:54.570 --> 00:12:01.170 So when we have this kind of bootstrap distribution we will take our replication 00:12:01.170 --> 00:12:07.050 here to 25th replication that is our lower limit, and we take the 975th 00:12:07.050 --> 00:12:12.060 replication here, that is our upper limit. So that gives us the confidence interval 00:12:12.060 --> 00:12:17.490 for the mean that is estimated here. That has two problems this approach. 00:12:17.490 --> 00:12:25.410 First are the bootstrap distribution is biased. So the mean of these our bootstrap replications 00:12:25.410 --> 00:12:31.260 is about 0.15 and the actual sample value for the mean is zero. 00:12:31.260 --> 00:12:38.490 To account for that bias we have a bias corrected confidence intervals. 00:12:38.490 --> 00:12:46.470 The idea of bias corrected confidence intervals is that instead of taking the 25th and 975th 00:12:46.470 --> 00:12:52.470 bootstrap replicate as the endpoints, we first estimate how much the bootstrap bias 00:12:52.470 --> 00:13:00.180 is and then a based on that estimate we take for example the 40th and 988 replication. 00:13:00.180 --> 00:13:06.660 So instead of taking the fixed 25th and fixed non orders and the fifth we adjust which are in which 00:13:06.660 --> 00:13:13.980 replicates replicates we take as the end points. There's also the problem that the variance the 00:13:13.980 --> 00:13:19.020 standard deviation here is no goal was the same as the standard deviation of here. 00:13:19.020 --> 00:13:26.760 So in in the correlation example you you saw that the confidence interval decreased as they 00:13:26.760 --> 00:13:34.080 are actual correlation estimate went up. So the idea is that are the width of the 00:13:34.080 --> 00:13:40.860 interval depends on the value of the estimate. To take that into account we have our bias 00:13:40.860 --> 00:13:47.220 correlate correctly and an accelerated confidence intervals, which apply the same idea at us the 00:13:47.220 --> 00:13:53.280 bias corrected ones but instead of just taking the bias into account, they take there are the 00:13:53.280 --> 00:13:59.400 estimated differences in variance of these two distributions into account when we choose the 00:13:59.400 --> 00:14:07.320 endpoints for they are the confidence intervals. Now the question is this these looks are really 00:14:07.320 --> 00:14:12.090 good so we can estimate our the variance of any any statistic 00:14:12.090 --> 00:14:17.640 empirically and we don't have to know the math. And yeah that's basically it's it's true with 00:14:17.640 --> 00:14:24.960 some qualifications. The qualifications are that our bootstrapping requires large sample size. 00:14:24.960 --> 00:14:31.650 There is a good article or a book chapter by by Koopman and co-authors in in the book edited by 00:14:31.650 --> 00:14:37.560 Bundaberg about statistical myths and urban legends, and they point out that there are 00:14:37.560 --> 00:14:43.620 three different claims made in the literature. There's the claim that bootstrapping works 00:14:43.620 --> 00:14:49.620 well in small samples and there is a fact that bootstrapping assumes that 00:14:49.620 --> 00:14:54.810 sample is representative of the populace. So if our sample is very different from the 00:14:54.810 --> 00:14:59.940 population then the bootstrap samples that we take from our original sample 00:14:59.940 --> 00:15:04.980 cannot approximate how the samples would actually behave from the real populace. 00:15:04.980 --> 00:15:10.740 Then our sampling error, which means how how different the sample is from the population, 00:15:10.740 --> 00:15:15.990 is it's troublesome in small samples. So in small samples the sample may not be 00:15:15.990 --> 00:15:23.400 very accurate representation of the population. So if if small samples are not representative 00:15:23.400 --> 00:15:31.290 population and if we require that sample is must be representative to population then bootstrapping 00:15:31.290 --> 00:15:34.200 cannot work in small samples. So bootstrapping generally 00:15:34.200 --> 00:15:37.690 requires a large sample size. Then there are also some boundary conditions 00:15:37.690 --> 00:15:43.150 under which bootstrapping doesn't work even if you have a large sample. So there are that kind of 00:15:43.150 --> 00:15:49.570 scenarios but for most practical applications are only the sample size is the thing that you need to 00:15:49.570 --> 00:15:55.480 be concerned about. The problem is that it is very hard to say when your sample size is large enough.