WEBVTT
Kind: captions
Language: en

00:00:00.180 --> 00:00:06.630
When we do statistical analysis we always get 
the point estimate or the estimate of the effect,  

00:00:06.630 --> 00:00:12.060
just one regression coefficient, or one number.
We also need to know how certain we are about  

00:00:12.060 --> 00:00:15.840
that number, and that certainty is 
quantified by the standard error. 

00:00:15.840 --> 00:00:20.730
So that standard error quantifies the 
precision and we use the standard error  

00:00:20.730 --> 00:00:25.980
and the actual estimate to calculate it 
a statistics that give us the p-values. 

00:00:25.980 --> 00:00:32.460
Some in some scenarios calculating the 
standard error is hard, or calculating  

00:00:32.460 --> 00:00:37.200
the standard error is something that requires 
assumptions that we are not willing to make,  

00:00:37.200 --> 00:00:42.600
or assumptions that we know that they are not 
true for our particular data and analysis. 

00:00:42.600 --> 00:00:47.940
Bootstrapping provides an alternative way of 
calculating standard errors or estimating how much  

00:00:47.940 --> 00:00:54.060
a statistic would vary from one sample to another.
And bootstrapping is like a computational approach  

00:00:54.060 --> 00:00:59.760
to the problem of calculating a standard error.
How bootstrapping works is that we have our  

00:00:59.760 --> 00:01:04.590
original sample. So we have a sample 
of 10 observations here from a normally  

00:01:04.590 --> 00:01:08.070
distributed population with mean 
of 0 and standard deviation of 1. 

00:01:08.070 --> 00:01:14.640
So that's our original sample here, the mean is 
zero point 13 from that sample, and if we take our  

00:01:14.640 --> 00:01:20.880
multiple samples from the same population.
Here is the sampling distribution of the  

00:01:20.880 --> 00:01:25.350
of the sample mean if the sample 
size is 10 from this population. 

00:01:25.350 --> 00:01:30.930
Most of the time we get values close to the 
0 which is the population value of mean. 

00:01:30.930 --> 00:01:36.540
And then we have some it sometimes 
we get estimates that are far from  

00:01:36.540 --> 00:01:41.340
are from the actual population value.
The idea of bootstrapping is that if  

00:01:41.340 --> 00:01:49.770
we don't know how we estimate this are the 
width of this sampling distribution or the  

00:01:49.770 --> 00:01:55.350
shape using statistical theory, or are closed 
for equation, then we can do the temperately. 

00:01:55.350 --> 00:02:02.340
So instead of our calculating it using an equation 
we take repeated samples from our original sample. 

00:02:02.340 --> 00:02:06.000
So that's our original sample it forms 
the population for the bootstrap. 

00:02:06.000 --> 00:02:15.870
Then we take a repeated sample so we take first 
or 0.31 it is our here, then we put it back so  

00:02:15.870 --> 00:02:20.760
we allow every observation to include it 
be included in the sample multiple times. 

00:02:20.760 --> 00:02:27.390
Then we take randomly another another 
one 0.83, it's here we put it back,  

00:02:27.390 --> 00:02:31.860
then we take yet another number yet 
another number, we take the zero  

00:02:31.860 --> 00:02:37.590
point - 0.84 the second time. And so on.
So we take these these samples from an  

00:02:37.590 --> 00:02:42.720
original data and every observation can be 
included in the in the sample multiple times. 

00:02:42.720 --> 00:02:49.260
So each of these are randomly chosen numbers 
doesn't depend on any any other previous choices. 

00:02:49.260 --> 00:02:56.910
So we get our using this bootstrap sample we get 
0.34 a sample mean, we calculated it many many  

00:02:56.910 --> 00:03:01.470
times typically we do one hundred five hundred 
or thousand times or even ten thousand times  

00:03:01.470 --> 00:03:08.040
,depending on the complexity of the calculation.
Thousand repetitions is quite normal nowadays. 

00:03:08.040 --> 00:03:14.820
So we can see that from sample to sample this 
on sample mean varies, and these are various of  

00:03:14.820 --> 00:03:21.060
sample mean, calculator the distribution of this 
sample mean from the bootstrap samples, calculated  

00:03:21.060 --> 00:03:28.800
from our thousand bootstrap replications here 
is about the same shape as that if we would  

00:03:28.800 --> 00:03:35.550
take the samples from the actual population.
So these two our distributions are quite similar  

00:03:35.550 --> 00:03:42.930
and we can use that information the knowledge that 
these two distributions are similar, they approach  

00:03:42.930 --> 00:03:49.230
each other when the sample size increases 
we can use that knowledge to say that this  

00:03:49.230 --> 00:03:55.500
our distribution here is a good representation of 
that distribution and if we want to estimate the  

00:03:55.500 --> 00:04:01.470
standard deviation of this distribution, which 
is what standard error quantifies or estimates,  

00:04:01.470 --> 00:04:07.380
then we can just use the standard deviation of 
that distribution. Here we can see that the mean  

00:04:07.380 --> 00:04:11.220
of this distribution is slightly off.
That's called the bootstrap bias. 

00:04:11.220 --> 00:04:18.420
So this mean here is roughly at the mean here.
So it's not that the population mean in instead  

00:04:18.420 --> 00:04:21.360
of if it's closer that they are 
the mean of this particular sample. 

00:04:21.360 --> 00:04:27.850
Then we they're also the width of this 
distribution is in this case slightly smaller,  

00:04:27.850 --> 00:04:31.420
so that this person here is slightly 
smaller than the dispersion here. 

00:04:31.420 --> 00:04:37.870
And that is also uh something that we in 
sometimes need to take in the consideration. 

00:04:37.870 --> 00:04:43.360
The key thing in bootstrapping is that one 
sample size increases then this mean and  

00:04:43.360 --> 00:04:47.230
the standard deviation will be closer to 
did that mean and that standard deviation. 

00:04:47.800 --> 00:04:53.410
Let's take a look at a demonstration 
of how bootstrapping works this is  

00:04:53.410 --> 00:04:59.080
a video from our from a Statistics 
Department from University of Auckland. 

00:04:59.080 --> 00:05:05.320
And they demonstrate that you have your 
original sample here, so we have two variables. 

00:05:05.320 --> 00:05:13.270
We have up there on X variable and Y variable 
and then we have a regression coefficient. 

00:05:13.270 --> 00:05:19.120
So we calculate the regression coefficient 
here and we are interested in how much this  

00:05:19.120 --> 00:05:24.850
regression coefficient the slope would 
bury, if we were to take this sample  

00:05:24.850 --> 00:05:28.750
over and over from the same population.
So that's what the standard error quantifies  

00:05:28.750 --> 00:05:34.300
for some reason we don't want to cut use the 
normal formula that our statistical software  

00:05:34.300 --> 00:05:37.810
uses to calculate the standard error.
We want to do it by bootstrapping. 

00:05:37.810 --> 00:05:44.620
So we take some samples from our original data.
So we take samples from your original data  

00:05:44.620 --> 00:05:54.040
like so you can see our here that eats 
observation can be included multiple times. 

00:05:54.040 --> 00:05:59.350
Sometimes an observation is not included in the 
sample then we get the regression coefficient that  

00:05:59.350 --> 00:06:04.420
is slightly different from the original one.
We do another bootstrap sample,  

00:06:04.420 --> 00:06:08.350
we get another regression coefficient again 
slightly different from the original one. 

00:06:08.350 --> 00:06:19.570
We take yet another bootstrap sample, 
we get slightly different one and we go  

00:06:19.570 --> 00:06:25.570
on a hundred times a thousand times, and 
ultimately we get an estimate of how much  

00:06:25.570 --> 00:06:30.880
this regression coefficient would really vary 
if we were to take multiple different samples.

00:06:41.880 --> 00:06:46.380
So that's that's when you get a 
thousand samples or a hundred samples. 

00:06:46.380 --> 00:06:53.070
Then you can see that the variance of the 
rigors and coefficient is that much between  

00:06:53.070 --> 00:06:58.500
the bootstrap samples, and if sample size is 
large enough this variation of the bootstrap  

00:06:58.500 --> 00:07:04.650
samples is a good approximation of how much the 
regression coefficient would vary, if we were to  

00:07:04.650 --> 00:07:10.290
repeat the same independent samples from the same 
population our calculate the regression analysis  

00:07:10.290 --> 00:07:16.260
again and again from those independent samples.
Bootstrapping it can be used to calculate the  

00:07:16.260 --> 00:07:21.450
standard error within which case we just take a 
standard deviation of these recursive slopes and  

00:07:21.450 --> 00:07:25.770
then that is our standard error estimate.
We can also use bootstrapping to  

00:07:25.770 --> 00:07:29.700
calculate confidence intervals.
So the idea of a confidence interval is  

00:07:29.700 --> 00:07:36.210
that our instead of estimating a standard error 
and a p-value, we estimate a point estimate. 

00:07:36.210 --> 00:07:43.740
So for example a value of a correlation one single 
value and then we estimate arm an interval let's  

00:07:43.740 --> 00:07:50.640
say 95% a interval which has an upper limit and 
lower limit and then if we repeat the calculation  

00:07:50.640 --> 00:07:57.120
many many times from independent samples. Then 
the population value will be within the interval  

00:07:57.120 --> 00:08:03.630
if it's a valid interval 95% of the files.
So this is an example of correlation and we  

00:08:03.630 --> 00:08:08.430
can see that the correlation estimates when 
there is a zero correlation in the population  

00:08:08.430 --> 00:08:13.110
we have a small sample size they vary between 
zero point minus 0.2 and plus zero point two. 

00:08:13.110 --> 00:08:20.340
And most of the time when we draw the confidence 
interval which is the bar the line here the  

00:08:20.340 --> 00:08:24.660
line includes the population value.
This is two and a half percent of the  

00:08:24.660 --> 00:08:31.020
replications here, and it doesn't include 
the population values. So the population  

00:08:31.020 --> 00:08:37.650
value here falls above there are the upper limit.
Here we have extremely large correlations and the  

00:08:37.650 --> 00:08:44.970
population value for about two and a half percent 
of the replications falls below the lower limit. 

00:08:44.970 --> 00:08:50.910
In 95% of the cases here there are 
population value is within the interval. 

00:08:50.910 --> 00:08:56.820
So that's our the idea of confidence intervals.
Here we can see them when the police value is  

00:08:56.820 --> 00:09:04.350
large then the width of the confidence interval 
depends on the correlation estimate. So when the  

00:09:04.350 --> 00:09:09.720
correlation estimate is very high then there 
are the confidence intervals it's narrow. When  

00:09:09.720 --> 00:09:15.570
the corrosion estimate is very low then it's a 
lot wider here they are the confidence interval. 

00:09:15.570 --> 00:09:20.400
So the confidence interval depends on the 
value of the statistic and also it depends  

00:09:20.400 --> 00:09:27.630
on the estimated standard error of the statistic.
Now there are a couple of ways that bootstrapping  

00:09:27.630 --> 00:09:31.020
can be used for calculating 
your confidence interval in. 

00:09:31.020 --> 00:09:34.440
Normally when we do confidence intervals 
we use the normal approximation. 

00:09:34.440 --> 00:09:38.430
So the idea is that our we 
assume that the estimate is  

00:09:38.430 --> 00:09:44.070
normally distributed over repeated samples.
Then we calculate the confidence interval,  

00:09:44.070 --> 00:09:51.690
it is our estimate plus or minus 1.96 
which covers 95% of the normal distribution  

00:09:51.690 --> 00:09:57.750
multiplied by the standard error.
So that gives us the plus or minus. 

00:09:57.750 --> 00:10:04.320
So if we have an estimate of correlation 
that is here, then we multiply the standard  

00:10:04.320 --> 00:10:11.910
error or by 1.96 - estimate minus.
That is the load limit estimate plus  

00:10:11.910 --> 00:10:17.580
one point nine times the standard error is here.
So that gives us the upper and lower limit in this  

00:10:17.580 --> 00:10:23.400
example or one percent and 13 percent, when 
the actual estimate is about our 5 percent. 

00:10:23.400 --> 00:10:30.150
So we calculate how we use bootstrapping 
for this calculation is that there there  

00:10:30.150 --> 00:10:33.630
the standard error is simply the standard 
deviation of the bootstrap estimate. 

00:10:33.630 --> 00:10:39.090
So if we take a correlation with bootstrap 
it, then we calculate how much their  

00:10:39.090 --> 00:10:44.520
correlation varies between the bootstrap 
samples using standard deviation metric,  

00:10:44.520 --> 00:10:50.820
and then we use that plug that plug that in.
That formula gives us the confidence intervals. 

00:10:50.820 --> 00:10:56.940
So that works when we can assume that 
the estimate is not a disability. 

00:10:56.940 --> 00:11:00.480
What if we can't assume 
that the estimate you know. 

00:11:00.480 --> 00:11:05.940
That is the case when we are can use empirical 
confidence intervals based on bootstrapping. 

00:11:05.940 --> 00:11:12.780
So the idea of the normal approximation interval 
is that the estimate is normally distributed. 

00:11:12.780 --> 00:11:17.820
Then we can use this equation or we 
can use empirical confidence intervals. 

00:11:17.820 --> 00:11:24.570
The idea of an implicit confidence interval is 
that we we do the bootstrapping and then we let's  

00:11:24.570 --> 00:11:31.560
say we take our thousand bootstrap replications.
Then we take the 25th from smallest to largest,  

00:11:31.560 --> 00:11:37.350
we take the 25th value of the 
bootstrap replicates and that is  

00:11:37.350 --> 00:11:45.180
our lower limit for the confidence interval.
Then we take the 9705th and that is our the  

00:11:45.180 --> 00:11:49.980
upper limit so that's our two and a half 
percent a 97 a half percent and that's the  

00:11:49.980 --> 00:11:54.570
upper limit of our or confidence interval.
So that's called percentage intervals. 

00:11:54.570 --> 00:12:01.170
So when we have this kind of bootstrap 
distribution we will take our replication  

00:12:01.170 --> 00:12:07.050
here to 25th replication that is our 
lower limit, and we take the 975th  

00:12:07.050 --> 00:12:12.060
replication here, that is our upper limit.
So that gives us the confidence interval  

00:12:12.060 --> 00:12:17.490
for the mean that is estimated here.
That has two problems this approach. 

00:12:17.490 --> 00:12:25.410
First are the bootstrap distribution is biased.
So the mean of these our bootstrap replications  

00:12:25.410 --> 00:12:31.260
is about 0.15 and the actual 
sample value for the mean is zero. 

00:12:31.260 --> 00:12:38.490
To account for that bias we have a 
bias corrected confidence intervals. 

00:12:38.490 --> 00:12:46.470
The idea of bias corrected confidence intervals 
is that instead of taking the 25th and 975th  

00:12:46.470 --> 00:12:52.470
bootstrap replicate as the endpoints, we 
first estimate how much the bootstrap bias  

00:12:52.470 --> 00:13:00.180
is and then a based on that estimate we take 
for example the 40th and 988 replication. 

00:13:00.180 --> 00:13:06.660
So instead of taking the fixed 25th and fixed non 
orders and the fifth we adjust which are in which  

00:13:06.660 --> 00:13:13.980
replicates replicates we take as the end points.
There's also the problem that the variance the  

00:13:13.980 --> 00:13:19.020
standard deviation here is no goal was the 
same as the standard deviation of here. 

00:13:19.020 --> 00:13:26.760
So in in the correlation example you you saw 
that the confidence interval decreased as they  

00:13:26.760 --> 00:13:34.080
are actual correlation estimate went up.
So the idea is that are the width of the  

00:13:34.080 --> 00:13:40.860
interval depends on the value of the estimate.
To take that into account we have our bias  

00:13:40.860 --> 00:13:47.220
correlate correctly and an accelerated confidence 
intervals, which apply the same idea at us the  

00:13:47.220 --> 00:13:53.280
bias corrected ones but instead of just taking 
the bias into account, they take there are the  

00:13:53.280 --> 00:13:59.400
estimated differences in variance of these two 
distributions into account when we choose the  

00:13:59.400 --> 00:14:07.320
endpoints for they are the confidence intervals.
Now the question is this these looks are really  

00:14:07.320 --> 00:14:12.090
good so we can estimate our the 
variance of any any statistic  

00:14:12.090 --> 00:14:17.640
empirically and we don't have to know the math.
And yeah that's basically it's it's true with  

00:14:17.640 --> 00:14:24.960
some qualifications. The qualifications are that 
our bootstrapping requires large sample size. 

00:14:24.960 --> 00:14:31.650
There is a good article or a book chapter by by 
Koopman and co-authors in in the book edited by  

00:14:31.650 --> 00:14:37.560
Bundaberg about statistical myths and urban 
legends, and they point out that there are  

00:14:37.560 --> 00:14:43.620
three different claims made in the literature.
There's the claim that bootstrapping works  

00:14:43.620 --> 00:14:49.620
well in small samples and there is a 
fact that bootstrapping assumes that  

00:14:49.620 --> 00:14:54.810
sample is representative of the populace.
So if our sample is very different from the  

00:14:54.810 --> 00:14:59.940
population then the bootstrap samples 
that we take from our original sample  

00:14:59.940 --> 00:15:04.980
cannot approximate how the samples would 
actually behave from the real populace. 

00:15:04.980 --> 00:15:10.740
Then our sampling error, which means how how 
different the sample is from the population,  

00:15:10.740 --> 00:15:15.990
is it's troublesome in small samples.
So in small samples the sample may not be  

00:15:15.990 --> 00:15:23.400
very accurate representation of the population.
So if if small samples are not representative  

00:15:23.400 --> 00:15:31.290
population and if we require that sample is must 
be representative to population then bootstrapping  

00:15:31.290 --> 00:15:34.200
cannot work in small samples.
So bootstrapping generally  

00:15:34.200 --> 00:15:37.690
requires a large sample size.
Then there are also some boundary conditions  

00:15:37.690 --> 00:15:43.150
under which bootstrapping doesn't work even if 
you have a large sample. So there are that kind of  

00:15:43.150 --> 00:15:49.570
scenarios but for most practical applications are 
only the sample size is the thing that you need to  

00:15:49.570 --> 00:15:55.480
be concerned about. The problem is that it is very 
hard to say when your sample size is large enough.