WEBVTT
Kind: captions
Language: en

00:00:00.180 --> 00:00:06.630
When we do statistical analysis we always get&nbsp;
the point estimate or the estimate of the effect,&nbsp;&nbsp;

00:00:06.630 --> 00:00:12.060
just one regression coefficient, or one number.
We also need to know how certain we are about&nbsp;&nbsp;

00:00:12.060 --> 00:00:15.840
that number, and that certainty is&nbsp;
quantified by the standard error.&nbsp;

00:00:15.840 --> 00:00:20.730
So that standard error quantifies the&nbsp;
precision and we use the standard error&nbsp;&nbsp;

00:00:20.730 --> 00:00:25.980
and the actual estimate to calculate it&nbsp;
a statistics that give us the p-values.&nbsp;

00:00:25.980 --> 00:00:32.460
Some in some scenarios calculating the&nbsp;
standard error is hard, or calculating&nbsp;&nbsp;

00:00:32.460 --> 00:00:37.200
the standard error is something that requires&nbsp;
assumptions that we are not willing to make,&nbsp;&nbsp;

00:00:37.200 --> 00:00:42.600
or assumptions that we know that they are not&nbsp;
true for our particular data and analysis.&nbsp;

00:00:42.600 --> 00:00:47.940
Bootstrapping provides an alternative way of&nbsp;
calculating standard errors or estimating how much&nbsp;&nbsp;

00:00:47.940 --> 00:00:54.060
a statistic would vary from one sample to another.
And bootstrapping is like a computational approach&nbsp;&nbsp;

00:00:54.060 --> 00:00:59.760
to the problem of calculating a standard error.
How bootstrapping works is that we have our&nbsp;&nbsp;

00:00:59.760 --> 00:01:04.590
original sample. So we have a sample&nbsp;
of 10 observations here from a normally&nbsp;&nbsp;

00:01:04.590 --> 00:01:08.070
distributed population with mean&nbsp;
of 0 and standard deviation of 1.&nbsp;

00:01:08.070 --> 00:01:14.640
So that's our original sample here, the mean is&nbsp;
zero point 13 from that sample, and if we take our&nbsp;&nbsp;

00:01:14.640 --> 00:01:20.880
multiple samples from the same population.
Here is the sampling distribution of the&nbsp;&nbsp;

00:01:20.880 --> 00:01:25.350
of the sample mean if the sample&nbsp;
size is 10 from this population.&nbsp;

00:01:25.350 --> 00:01:30.930
Most of the time we get values close to the&nbsp;
0 which is the population value of mean.&nbsp;

00:01:30.930 --> 00:01:36.540
And then we have some it sometimes&nbsp;
we get estimates that are far from&nbsp;&nbsp;

00:01:36.540 --> 00:01:41.340
are from the actual population value.
The idea of bootstrapping is that if&nbsp;&nbsp;

00:01:41.340 --> 00:01:49.770
we don't know how we estimate this are the&nbsp;
width of this sampling distribution or the&nbsp;&nbsp;

00:01:49.770 --> 00:01:55.350
shape using statistical theory, or are closed&nbsp;
for equation, then we can do the temperately.&nbsp;

00:01:55.350 --> 00:02:02.340
So instead of our calculating it using an equation&nbsp;
we take repeated samples from our original sample.&nbsp;

00:02:02.340 --> 00:02:06.000
So that's our original sample it forms&nbsp;
the population for the bootstrap.&nbsp;

00:02:06.000 --> 00:02:15.870
Then we take a repeated sample so we take first&nbsp;
or 0.31 it is our here, then we put it back so&nbsp;&nbsp;

00:02:15.870 --> 00:02:20.760
we allow every observation to include it&nbsp;
be included in the sample multiple times.&nbsp;

00:02:20.760 --> 00:02:27.390
Then we take randomly another another&nbsp;
one 0.83, it's here we put it back,&nbsp;&nbsp;

00:02:27.390 --> 00:02:31.860
then we take yet another number yet&nbsp;
another number, we take the zero&nbsp;&nbsp;

00:02:31.860 --> 00:02:37.590
point - 0.84 the second time. And so on.
So we take these these samples from an&nbsp;&nbsp;

00:02:37.590 --> 00:02:42.720
original data and every observation can be&nbsp;
included in the in the sample multiple times.&nbsp;

00:02:42.720 --> 00:02:49.260
So each of these are randomly chosen numbers&nbsp;
doesn't depend on any any other previous choices.&nbsp;

00:02:49.260 --> 00:02:56.910
So we get our using this bootstrap sample we get&nbsp;
0.34 a sample mean, we calculated it many many&nbsp;&nbsp;

00:02:56.910 --> 00:03:01.470
times typically we do one hundred five hundred&nbsp;
or thousand times or even ten thousand times&nbsp;&nbsp;

00:03:01.470 --> 00:03:08.040
,depending on the complexity of the calculation.
Thousand repetitions is quite normal nowadays.&nbsp;

00:03:08.040 --> 00:03:14.820
So we can see that from sample to sample this&nbsp;
on sample mean varies, and these are various of&nbsp;&nbsp;

00:03:14.820 --> 00:03:21.060
sample mean, calculator the distribution of this&nbsp;
sample mean from the bootstrap samples, calculated&nbsp;&nbsp;

00:03:21.060 --> 00:03:28.800
from our thousand bootstrap replications here&nbsp;
is about the same shape as that if we would&nbsp;&nbsp;

00:03:28.800 --> 00:03:35.550
take the samples from the actual population.
So these two our distributions are quite similar&nbsp;&nbsp;

00:03:35.550 --> 00:03:42.930
and we can use that information the knowledge that&nbsp;
these two distributions are similar, they approach&nbsp;&nbsp;

00:03:42.930 --> 00:03:49.230
each other when the sample size increases&nbsp;
we can use that knowledge to say that this&nbsp;&nbsp;

00:03:49.230 --> 00:03:55.500
our distribution here is a good representation of&nbsp;
that distribution and if we want to estimate the&nbsp;&nbsp;

00:03:55.500 --> 00:04:01.470
standard deviation of this distribution, which&nbsp;
is what standard error quantifies or estimates,&nbsp;&nbsp;

00:04:01.470 --> 00:04:07.380
then we can just use the standard deviation of&nbsp;
that distribution. Here we can see that the mean&nbsp;&nbsp;

00:04:07.380 --> 00:04:11.220
of this distribution is slightly off.
That's called the bootstrap bias.&nbsp;

00:04:11.220 --> 00:04:18.420
So this mean here is roughly at the mean here.
So it's not that the population mean in instead&nbsp;&nbsp;

00:04:18.420 --> 00:04:21.360
of if it's closer that they are&nbsp;
the mean of this particular sample.&nbsp;

00:04:21.360 --> 00:04:27.850
Then we they're also the width of this&nbsp;
distribution is in this case slightly smaller,&nbsp;&nbsp;

00:04:27.850 --> 00:04:31.420
so that this person here is slightly&nbsp;
smaller than the dispersion here.&nbsp;

00:04:31.420 --> 00:04:37.870
And that is also uh something that we in&nbsp;
sometimes need to take in the consideration.&nbsp;

00:04:37.870 --> 00:04:43.360
The key thing in bootstrapping is that one&nbsp;
sample size increases then this mean and&nbsp;&nbsp;

00:04:43.360 --> 00:04:47.230
the standard deviation will be closer to&nbsp;
did that mean and that standard deviation.&nbsp;

00:04:47.800 --> 00:04:53.410
Let's take a look at a demonstration&nbsp;
of how bootstrapping works this is&nbsp;&nbsp;

00:04:53.410 --> 00:04:59.080
a video from our from a Statistics&nbsp;
Department from University of Auckland.&nbsp;

00:04:59.080 --> 00:05:05.320
And they demonstrate that you have your&nbsp;
original sample here, so we have two variables.&nbsp;

00:05:05.320 --> 00:05:13.270
We have up there on X variable and Y variable&nbsp;
and then we have a regression coefficient.&nbsp;

00:05:13.270 --> 00:05:19.120
So we calculate the regression coefficient&nbsp;
here and we are interested in how much this&nbsp;&nbsp;

00:05:19.120 --> 00:05:24.850
regression coefficient the slope would&nbsp;
bury, if we were to take this sample&nbsp;&nbsp;

00:05:24.850 --> 00:05:28.750
over and over from the same population.
So that's what the standard error quantifies&nbsp;&nbsp;

00:05:28.750 --> 00:05:34.300
for some reason we don't want to cut use the&nbsp;
normal formula that our statistical software&nbsp;&nbsp;

00:05:34.300 --> 00:05:37.810
uses to calculate the standard error.
We want to do it by bootstrapping.&nbsp;

00:05:37.810 --> 00:05:44.620
So we take some samples from our original data.
So we take samples from your original data&nbsp;&nbsp;

00:05:44.620 --> 00:05:54.040
like so you can see our here that eats&nbsp;
observation can be included multiple times.&nbsp;

00:05:54.040 --> 00:05:59.350
Sometimes an observation is not included in the&nbsp;
sample then we get the regression coefficient that&nbsp;&nbsp;

00:05:59.350 --> 00:06:04.420
is slightly different from the original one.
We do another bootstrap sample,&nbsp;&nbsp;

00:06:04.420 --> 00:06:08.350
we get another regression coefficient again&nbsp;
slightly different from the original one.&nbsp;

00:06:08.350 --> 00:06:19.570
We take yet another bootstrap sample,&nbsp;
we get slightly different one and we go&nbsp;&nbsp;

00:06:19.570 --> 00:06:25.570
on a hundred times a thousand times, and&nbsp;
ultimately we get an estimate of how much&nbsp;&nbsp;

00:06:25.570 --> 00:06:30.880
this regression coefficient would really vary&nbsp;
if we were to take multiple different samples.

00:06:41.880 --> 00:06:46.380
So that's that's when you get a&nbsp;
thousand samples or a hundred samples.&nbsp;

00:06:46.380 --> 00:06:53.070
Then you can see that the variance of the&nbsp;
rigors and coefficient is that much between&nbsp;&nbsp;

00:06:53.070 --> 00:06:58.500
the bootstrap samples, and if sample size is&nbsp;
large enough this variation of the bootstrap&nbsp;&nbsp;

00:06:58.500 --> 00:07:04.650
samples is a good approximation of how much the&nbsp;
regression coefficient would vary, if we were to&nbsp;&nbsp;

00:07:04.650 --> 00:07:10.290
repeat the same independent samples from the same&nbsp;
population our calculate the regression analysis&nbsp;&nbsp;

00:07:10.290 --> 00:07:16.260
again and again from those independent samples.
Bootstrapping it can be used to calculate the&nbsp;&nbsp;

00:07:16.260 --> 00:07:21.450
standard error within which case we just take a&nbsp;
standard deviation of these recursive slopes and&nbsp;&nbsp;

00:07:21.450 --> 00:07:25.770
then that is our standard error estimate.
We can also use bootstrapping to&nbsp;&nbsp;

00:07:25.770 --> 00:07:29.700
calculate confidence intervals.
So the idea of a confidence interval is&nbsp;&nbsp;

00:07:29.700 --> 00:07:36.210
that our instead of estimating a standard error&nbsp;
and a p-value, we estimate a point estimate.&nbsp;

00:07:36.210 --> 00:07:43.740
So for example a value of a correlation one single&nbsp;
value and then we estimate arm an interval let's&nbsp;&nbsp;

00:07:43.740 --> 00:07:50.640
say 95% a interval which has an upper limit and&nbsp;
lower limit and then if we repeat the calculation&nbsp;&nbsp;

00:07:50.640 --> 00:07:57.120
many many times from independent samples. Then&nbsp;
the population value will be within the interval&nbsp;&nbsp;

00:07:57.120 --> 00:08:03.630
if it's a valid interval 95% of the files.
So this is an example of correlation and we&nbsp;&nbsp;

00:08:03.630 --> 00:08:08.430
can see that the correlation estimates when&nbsp;
there is a zero correlation in the population&nbsp;&nbsp;

00:08:08.430 --> 00:08:13.110
we have a small sample size they vary between&nbsp;
zero point minus 0.2 and plus zero point two.&nbsp;

00:08:13.110 --> 00:08:20.340
And most of the time when we draw the confidence&nbsp;
interval which is the bar the line here the&nbsp;&nbsp;

00:08:20.340 --> 00:08:24.660
line includes the population value.
This is two and a half percent of the&nbsp;&nbsp;

00:08:24.660 --> 00:08:31.020
replications here, and it doesn't include&nbsp;
the population values. So the population&nbsp;&nbsp;

00:08:31.020 --> 00:08:37.650
value here falls above there are the upper limit.
Here we have extremely large correlations and the&nbsp;&nbsp;

00:08:37.650 --> 00:08:44.970
population value for about two and a half percent&nbsp;
of the replications falls below the lower limit.&nbsp;

00:08:44.970 --> 00:08:50.910
In 95% of the cases here there are&nbsp;
population value is within the interval.&nbsp;

00:08:50.910 --> 00:08:56.820
So that's our the idea of confidence intervals.
Here we can see them when the police value is&nbsp;&nbsp;

00:08:56.820 --> 00:09:04.350
large then the width of the confidence interval&nbsp;
depends on the correlation estimate. So when the&nbsp;&nbsp;

00:09:04.350 --> 00:09:09.720
correlation estimate is very high then there&nbsp;
are the confidence intervals it's narrow. When&nbsp;&nbsp;

00:09:09.720 --> 00:09:15.570
the corrosion estimate is very low then it's a&nbsp;
lot wider here they are the confidence interval.&nbsp;

00:09:15.570 --> 00:09:20.400
So the confidence interval depends on the&nbsp;
value of the statistic and also it depends&nbsp;&nbsp;

00:09:20.400 --> 00:09:27.630
on the estimated standard error of the statistic.
Now there are a couple of ways that bootstrapping&nbsp;&nbsp;

00:09:27.630 --> 00:09:31.020
can be used for calculating&nbsp;
your confidence interval in.&nbsp;

00:09:31.020 --> 00:09:34.440
Normally when we do confidence intervals&nbsp;
we use the normal approximation.&nbsp;

00:09:34.440 --> 00:09:38.430
So the idea is that our we&nbsp;
assume that the estimate is&nbsp;&nbsp;

00:09:38.430 --> 00:09:44.070
normally distributed over repeated samples.
Then we calculate the confidence interval,&nbsp;&nbsp;

00:09:44.070 --> 00:09:51.690
it is our estimate plus or minus 1.96&nbsp;
which covers 95% of the normal distribution&nbsp;&nbsp;

00:09:51.690 --> 00:09:57.750
multiplied by the standard error.
So that gives us the plus or minus.&nbsp;

00:09:57.750 --> 00:10:04.320
So if we have an estimate of correlation&nbsp;
that is here, then we multiply the standard&nbsp;&nbsp;

00:10:04.320 --> 00:10:11.910
error or by 1.96 - estimate minus.
That is the load limit estimate plus&nbsp;&nbsp;

00:10:11.910 --> 00:10:17.580
one point nine times the standard error is here.
So that gives us the upper and lower limit in this&nbsp;&nbsp;

00:10:17.580 --> 00:10:23.400
example or one percent and 13 percent, when&nbsp;
the actual estimate is about our 5 percent.&nbsp;

00:10:23.400 --> 00:10:30.150
So we calculate how we use bootstrapping&nbsp;
for this calculation is that there there&nbsp;&nbsp;

00:10:30.150 --> 00:10:33.630
the standard error is simply the standard&nbsp;
deviation of the bootstrap estimate.&nbsp;

00:10:33.630 --> 00:10:39.090
So if we take a correlation with bootstrap&nbsp;
it, then we calculate how much their&nbsp;&nbsp;

00:10:39.090 --> 00:10:44.520
correlation varies between the bootstrap&nbsp;
samples using standard deviation metric,&nbsp;&nbsp;

00:10:44.520 --> 00:10:50.820
and then we use that plug that plug that in.
That formula gives us the confidence intervals.&nbsp;

00:10:50.820 --> 00:10:56.940
So that works when we can assume that&nbsp;
the estimate is not a disability.&nbsp;

00:10:56.940 --> 00:11:00.480
What if we can't assume&nbsp;
that the estimate you know.&nbsp;

00:11:00.480 --> 00:11:05.940
That is the case when we are can use empirical&nbsp;
confidence intervals based on bootstrapping.&nbsp;

00:11:05.940 --> 00:11:12.780
So the idea of the normal approximation interval&nbsp;
is that the estimate is normally distributed.&nbsp;

00:11:12.780 --> 00:11:17.820
Then we can use this equation or we&nbsp;
can use empirical confidence intervals.&nbsp;

00:11:17.820 --> 00:11:24.570
The idea of an implicit confidence interval is&nbsp;
that we we do the bootstrapping and then we let's&nbsp;&nbsp;

00:11:24.570 --> 00:11:31.560
say we take our thousand bootstrap replications.
Then we take the 25th from smallest to largest,&nbsp;&nbsp;

00:11:31.560 --> 00:11:37.350
we take the 25th value of the&nbsp;
bootstrap replicates and that is&nbsp;&nbsp;

00:11:37.350 --> 00:11:45.180
our lower limit for the confidence interval.
Then we take the 9705th and that is our the&nbsp;&nbsp;

00:11:45.180 --> 00:11:49.980
upper limit so that's our two and a half&nbsp;
percent a 97 a half percent and that's the&nbsp;&nbsp;

00:11:49.980 --> 00:11:54.570
upper limit of our or confidence interval.
So that's called percentage intervals.&nbsp;

00:11:54.570 --> 00:12:01.170
So when we have this kind of bootstrap&nbsp;
distribution we will take our replication&nbsp;&nbsp;

00:12:01.170 --> 00:12:07.050
here to 25th replication that is our&nbsp;
lower limit, and we take the 975th&nbsp;&nbsp;

00:12:07.050 --> 00:12:12.060
replication here, that is our upper limit.
So that gives us the confidence interval&nbsp;&nbsp;

00:12:12.060 --> 00:12:17.490
for the mean that is estimated here.
That has two problems this approach.&nbsp;

00:12:17.490 --> 00:12:25.410
First are the bootstrap distribution is biased.
So the mean of these our bootstrap replications&nbsp;&nbsp;

00:12:25.410 --> 00:12:31.260
is about 0.15 and the actual&nbsp;
sample value for the mean is zero.&nbsp;

00:12:31.260 --> 00:12:38.490
To account for that bias we have a&nbsp;
bias corrected confidence intervals.&nbsp;

00:12:38.490 --> 00:12:46.470
The idea of bias corrected confidence intervals&nbsp;
is that instead of taking the 25th and 975th&nbsp;&nbsp;

00:12:46.470 --> 00:12:52.470
bootstrap replicate as the endpoints, we&nbsp;
first estimate how much the bootstrap bias&nbsp;&nbsp;

00:12:52.470 --> 00:13:00.180
is and then a based on that estimate we take&nbsp;
for example the 40th and 988 replication.&nbsp;

00:13:00.180 --> 00:13:06.660
So instead of taking the fixed 25th and fixed non&nbsp;
orders and the fifth we adjust which are in which&nbsp;&nbsp;

00:13:06.660 --> 00:13:13.980
replicates replicates we take as the end points.
There's also the problem that the variance the&nbsp;&nbsp;

00:13:13.980 --> 00:13:19.020
standard deviation here is no goal was the&nbsp;
same as the standard deviation of here.&nbsp;

00:13:19.020 --> 00:13:26.760
So in in the correlation example you you saw&nbsp;
that the confidence interval decreased as they&nbsp;&nbsp;

00:13:26.760 --> 00:13:34.080
are actual correlation estimate went up.
So the idea is that are the width of the&nbsp;&nbsp;

00:13:34.080 --> 00:13:40.860
interval depends on the value of the estimate.
To take that into account we have our bias&nbsp;&nbsp;

00:13:40.860 --> 00:13:47.220
correlate correctly and an accelerated confidence&nbsp;
intervals, which apply the same idea at us the&nbsp;&nbsp;

00:13:47.220 --> 00:13:53.280
bias corrected ones but instead of just taking&nbsp;
the bias into account, they take there are the&nbsp;&nbsp;

00:13:53.280 --> 00:13:59.400
estimated differences in variance of these two&nbsp;
distributions into account when we choose the&nbsp;&nbsp;

00:13:59.400 --> 00:14:07.320
endpoints for they are the confidence intervals.
Now the question is this these looks are really&nbsp;&nbsp;

00:14:07.320 --> 00:14:12.090
good so we can estimate our the&nbsp;
variance of any any statistic&nbsp;&nbsp;

00:14:12.090 --> 00:14:17.640
empirically and we don't have to know the math.
And yeah that's basically it's it's true with&nbsp;&nbsp;

00:14:17.640 --> 00:14:24.960
some qualifications. The qualifications are that&nbsp;
our bootstrapping requires large sample size.&nbsp;

00:14:24.960 --> 00:14:31.650
There is a good article or a book chapter by by&nbsp;
Koopman and co-authors in in the book edited by&nbsp;&nbsp;

00:14:31.650 --> 00:14:37.560
Bundaberg about statistical myths and urban&nbsp;
legends, and they point out that there are&nbsp;&nbsp;

00:14:37.560 --> 00:14:43.620
three different claims made in the literature.
There's the claim that bootstrapping works&nbsp;&nbsp;

00:14:43.620 --> 00:14:49.620
well in small samples and there is a&nbsp;
fact that bootstrapping assumes that&nbsp;&nbsp;

00:14:49.620 --> 00:14:54.810
sample is representative of the populace.
So if our sample is very different from the&nbsp;&nbsp;

00:14:54.810 --> 00:14:59.940
population then the bootstrap samples&nbsp;
that we take from our original sample&nbsp;&nbsp;

00:14:59.940 --> 00:15:04.980
cannot approximate how the samples would&nbsp;
actually behave from the real populace.&nbsp;

00:15:04.980 --> 00:15:10.740
Then our sampling error, which means how how&nbsp;
different the sample is from the population,&nbsp;&nbsp;

00:15:10.740 --> 00:15:15.990
is it's troublesome in small samples.
So in small samples the sample may not be&nbsp;&nbsp;

00:15:15.990 --> 00:15:23.400
very accurate representation of the population.
So if if small samples are not representative&nbsp;&nbsp;

00:15:23.400 --> 00:15:31.290
population and if we require that sample is must&nbsp;
be representative to population then bootstrapping&nbsp;&nbsp;

00:15:31.290 --> 00:15:34.200
cannot work in small samples.
So bootstrapping generally&nbsp;&nbsp;

00:15:34.200 --> 00:15:37.690
requires a large sample size.
Then there are also some boundary conditions&nbsp;&nbsp;

00:15:37.690 --> 00:15:43.150
under which bootstrapping doesn't work even if&nbsp;
you have a large sample. So there are that kind of&nbsp;&nbsp;

00:15:43.150 --> 00:15:49.570
scenarios but for most practical applications are&nbsp;
only the sample size is the thing that you need to&nbsp;&nbsp;

00:15:49.570 --> 00:15:55.480
be concerned about. The problem is that it is very&nbsp;
hard to say when your sample size is large enough.