WEBVTT
WEBVTT
Kind: captions
Language: en
00:00:00.060 --> 00:00:05.520
Regression analysis assumes that the sample
that you're analyzing is a random sample
00:00:05.520 --> 00:00:11.820
from the population. That could be violated for
example if you have 100 observations but those
00:00:11.820 --> 00:00:18.450
observations are measured from five different
people only each of which is measured 20 times.
00:00:18.450 --> 00:00:25.320
What is the impact of non-independent evidence
of observations on regression analysis and what
00:00:25.320 --> 00:00:29.550
kind of problems could that cause for
empirical analysis? Let's take a look.
00:00:29.550 --> 00:00:34.290
So here are the six regression
assumptions according to Bulvits
00:00:34.290 --> 00:00:40.950
and the second assumption is the independence
of observations. So what will happen if the
00:00:40.950 --> 00:00:46.470
observations are not independent and I will
go through this with a couple of examples.
00:00:46.470 --> 00:00:50.190
Let's take a simple example where
we are interested in estimating
00:00:50.190 --> 00:00:56.340
the mean of the population. So our
sample is 100 observations here and
00:00:56.340 --> 00:01:02.190
these 100 observations come from
five clusters. So let's say we are
00:01:02.190 --> 00:01:08.340
observing five companies over 20 years
or we are measuring reaction times from
00:01:08.340 --> 00:01:13.620
five people each measured 20 times and we
want to know what is the population mean.
00:01:13.620 --> 00:01:20.100
If interclass correlation is zero
or there is no dependence between
00:01:20.100 --> 00:01:26.070
the observations within a cluster we get a
very precise estimate of 0.08 for the mean.
00:01:26.070 --> 00:01:31.860
The actual population mean here is one and
the population variance is... A population
00:01:31.860 --> 00:01:36.300
mean is zero and the population variance
is 1. The interclass correlation is 0.
00:01:36.300 --> 00:01:41.970
What will happen if we increase the interclass
correlation? So we make these observations
00:01:41.970 --> 00:01:49.140
that are yellow green and purple we make
them closer to one another. So let's start
00:01:49.140 --> 00:01:54.780
clustering the data. We can see that these
yellow observations start the cluster here
00:01:54.780 --> 00:01:59.790
these purple observations start to go here and
green observations go somewhere in the middle.
00:01:59.790 --> 00:02:07.650
When we increase the interclass correlation of
these data so this is maintaining the variance
00:02:07.650 --> 00:02:12.690
of the data. When we increase the interclass
correlation we can see that the estimate of
00:02:12.690 --> 00:02:21.870
this mean or the sample mean became less and
less accurate estimator of the population mean.
00:02:21.870 --> 00:02:29.580
Originally when we have 100 independent
observation our estimate was 0.08 and after
00:02:29.580 --> 00:02:39.630
we have strongly clustered the data its 0.61.
When the interclass correlation is 1 we have a
00:02:39.630 --> 00:02:48.120
special case where there is no within cluster
varies. So we have 100 observations but only
00:02:48.120 --> 00:02:54.930
5 unique values. So if we only have 5 unique
values then it doesn't make a difference if
00:02:54.930 --> 00:03:02.940
we have each of those 5 values 1000 times or
just once because we gain no new information
00:03:02.940 --> 00:03:07.800
about where the population mean is because
after we have the first observation from a
00:03:07.800 --> 00:03:13.530
cluster then the other observations will bring
no more new information into the analysis.
00:03:13.530 --> 00:03:21.030
So the idea here is that when our data are
independent then each observation brings the same
00:03:21.030 --> 00:03:27.240
amount of new information to the analysis. When
the observations are dependent there is interclass
00:03:27.240 --> 00:03:33.000
correlation then the first observation from a
cluster brings lots of new unique information
00:03:33.000 --> 00:03:41.400
to the analysis but once we have the first
observation then the other observations
00:03:41.400 --> 00:03:46.800
from that the same cluster give you less and
less data about where the population mean is.
00:03:46.800 --> 00:03:53.280
For example if we want to measure what
is the average height of people in the
00:03:53.280 --> 00:03:59.760
University and we have a measurement tape that
contains some measurement errors somehow then
00:03:59.760 --> 00:04:06.390
it's better to measure 100 people than to
measure the same 10 people 10 times. And
00:04:06.390 --> 00:04:13.470
of course if you have no measurement error
then measuring the same 10 people or same
00:04:13.470 --> 00:04:18.570
5 people over and over and over will not
improve the precision of your estimate.
00:04:18.570 --> 00:04:25.110
So then the problem here is that when
interclass correlation increases when
00:04:25.110 --> 00:04:30.390
there is lack of non-independence then
our estimates will be less precise. They
00:04:30.390 --> 00:04:34.020
are still consistent. They are still
unbiased but they are less precise.
00:04:34.020 --> 00:04:39.150
Okay so that's one variable what if we
have two variables and we want to run a
00:04:39.150 --> 00:04:44.700
regression analysis? So we have X and we
have Y. We still have 100 observations
00:04:44.700 --> 00:04:49.110
nested in five clusters so we have
20 observations for each cluster.
00:04:49.110 --> 00:04:53.610
Initially interclass correlation is 0 so
all these observations are independent.
00:04:53.610 --> 00:05:00.510
There is no particular pattern in the colors
and a regression estimates are quite precise.
00:05:00.510 --> 00:05:06.810
They are actually interested in zero.
Our estimate is 0.1. The actual slope
00:05:06.810 --> 00:05:13.140
is 1. Our estimate is 1.07. So it's
pretty close. That's something that
00:05:13.140 --> 00:05:18.330
you can expect from 100 observations with one
explanatory variable in a regression analysis.
00:05:18.330 --> 00:05:24.510
When we increase the interclass correlation of
both these variables we can see again that there's
00:05:24.510 --> 00:05:30.630
some clustering. So these yellow observations
go here and these purple observations go here.
00:05:30.630 --> 00:05:37.200
Green observations go here and ultimately we are
in interclass correlation which is one. We are in
00:05:37.200 --> 00:05:44.220
a scenario where we have just 5 observations that
are repeated and again if we have the same data
00:05:44.220 --> 00:05:49.740
set we just repeat the observations that gives
us no new information for the estimation problem.
00:05:49.740 --> 00:05:57.360
The outcome is that when both of these
variables have clustering effects then our
00:05:57.360 --> 00:06:03.180
regression coefficients both the coefficient
and the slope will be less and less precise.
00:06:03.180 --> 00:06:08.640
They are still consistent and they are still
unbiased but the effect is the same as it was
00:06:08.640 --> 00:06:15.150
for the effect of estimating or in the case
when we estimated the mean from cluster data.
00:06:15.150 --> 00:06:24.720
So in effect interclass correlation decreases
our effective sample size. So if we have 100
00:06:24.720 --> 00:06:30.750
observations that are strongly clustered it's
possible that we actually have only 5 observations
00:06:30.750 --> 00:06:37.980
worth of information. In less extreme cases we
could have something like 100 observations but
00:06:37.980 --> 00:06:43.620
they actually give us information that is
only worth about 20 observations and so on.
00:06:43.620 --> 00:06:51.000
Things get more interesting if we only
have X that is clustered or we only have
00:06:51.000 --> 00:06:55.800
the error term that is clustered but not
the other. So let's take a look at first
00:06:55.800 --> 00:07:01.620
what happens when our X is cluster
but the error terms are independent.
00:07:01.620 --> 00:07:06.930
So we can see that the interclass correlation
again increases this bigger more and more
00:07:06.930 --> 00:07:14.370
clustered until we have just five values. In
this case when X is clustered but their term
00:07:14.370 --> 00:07:19.920
is not the clustering actually doesn't
have an effect. So we can see that the
00:07:19.920 --> 00:07:24.150
regression coefficient and the internal
slope are going to be slightly different
00:07:24.150 --> 00:07:28.890
when the clustering changes but that's just
because when you estimate the same quantity
00:07:28.890 --> 00:07:33.540
from different samples you will get different
results. So there is no systematic effect in
00:07:33.540 --> 00:07:37.560
the estimates getting worse and worse
when interclass correlation increases.
00:07:37.560 --> 00:07:44.010
The reason for this is that regression analysis
actually doesn't make any assumptions about the
00:07:44.010 --> 00:07:47.610
dependent variable. Sorry
the independent variable.
00:07:47.610 --> 00:07:52.470
Everything is estimated conditionally
on the observed value. So we could have
00:07:52.470 --> 00:07:57.690
a researcher that sets these X values.
For example in an experimental context
00:07:57.690 --> 00:08:03.120
we actually set these people into the
treatment group and into the control
00:08:03.120 --> 00:08:07.020
group so those are not random variable there's
something that we said as researchers and we
00:08:07.020 --> 00:08:11.760
could of course set them however we want and
regression analysis would not be affected.
00:08:11.760 --> 00:08:18.660
What if our X is random. X doesn't
or X is not clustered but the error
00:08:18.660 --> 00:08:22.920
term is clustered. This is something
that would be quite an unusual case
00:08:22.920 --> 00:08:25.680
but it's nevertheless useful
to understand what happens.
00:08:25.680 --> 00:08:34.320
So when we cluster the error term we effectively
reduce the variation or the unique values in
00:08:34.320 --> 00:08:41.610
the error term and it has one implication. The
implication is that this intercept is going to
00:08:41.610 --> 00:08:47.190
be estimated less precisely but the slope
estimate is going to stay about the same.
00:08:47.190 --> 00:08:56.250
One way to understand why that is the
case is that these error term values here,
00:08:56.250 --> 00:09:01.410
even if we have just one value for each
cluster, they will give us still very
00:09:01.410 --> 00:09:08.760
useful information about the direction of
the line but not on how high the line is.
00:09:08.760 --> 00:09:14.910
As you can see all these when the errors are
the exact same for each cluster interclass
00:09:14.910 --> 00:09:20.220
correlation is 1 then all of these
clusters form an exact line that
00:09:20.220 --> 00:09:26.100
is parallel to the population regression
line here, but the intercept is estimated
00:09:26.100 --> 00:09:31.020
less efficiently. So this would of
course be a very unusual scenario.
00:09:31.020 --> 00:09:40.410
Typically if you cannot assume that your error
term the unobserved sources of variation in
00:09:40.410 --> 00:09:47.100
the dependent variable are not independent
then typically your explanatory variables
00:09:47.100 --> 00:09:52.620
can be assumed to be independent either.
So we either have the case where the X,
00:09:52.620 --> 00:09:57.630
what error term is independent in which
would be the case in random sampling,
00:09:57.630 --> 00:10:06.600
but X could be non-independent for
example due to manipulation or we
00:10:06.600 --> 00:10:09.930
have the scenario where both of these
variables correlate within clusters.
00:10:09.930 --> 00:10:15.480
So why would this be a problem? Why is
non-independence of observations a problem
00:10:15.480 --> 00:10:23.610
and what is it a problem for or what is the
cause what does it cause? And as we saw non
00:10:23.610 --> 00:10:29.040
independence of observations doesn't lead to
bias. It doesn't lead to inconsistency but it
00:10:29.040 --> 00:10:35.010
leads to our less precise estimates and that
is something that we just can't do anything
00:10:35.010 --> 00:10:39.450
about it. If we don't have much information
then we can't estimate things precisely.
00:10:39.450 --> 00:10:47.250
But that doesn't really... That's not really
a problem because we can just state that well
00:10:47.250 --> 00:10:52.230
we have an estimate but it's not very precise
and sometimes we have to just live with that.
00:10:52.230 --> 00:10:58.650
The real problem is that if we look at the
standard error formula which is derived up
00:10:58.650 --> 00:11:05.130
based on this variance formula where
we just put plug in. There are the
00:11:05.130 --> 00:11:09.720
estimated variance of error term here
for the Sigma and we have sum squared
00:11:09.720 --> 00:11:16.650
total. This equation here only depends
on the variance of the error term. It
00:11:16.650 --> 00:11:21.450
depends on the variance of the predictor
variable and it depends on the sample size.
00:11:21.450 --> 00:11:29.730
If we have clustering effect in the data we
saw that estimates will be less precise even
00:11:29.730 --> 00:11:34.920
if the variance of the term and the variance
of the predictor and the sample size are the
00:11:34.920 --> 00:11:41.670
same. And this equation doesn't take the
clustering into account. So regardless of
00:11:41.670 --> 00:11:47.010
whether we have five observations that are
each is replicated 20 times in our data,
00:11:47.010 --> 00:11:55.470
so effect our sample size is 100 but it seems that
it's 1 but it seems to be what larger or when we
00:11:55.470 --> 00:11:59.640
actually have 100 unique observations,
this formula gives us the same result.
00:11:59.640 --> 00:12:08.040
And the outcome is that when you have clustering
then the standard errors are generally estimated
00:12:08.040 --> 00:12:14.400
inconsistently and they will be negatively
biased. So you will overstate the precision
00:12:14.400 --> 00:12:21.390
of the estimates and that will cause incorrect
inference and particularly it can lead to false
00:12:21.390 --> 00:12:27.630
positive findings rejecting the null hypothesis
where in fact it shouldn't be rejected.
00:12:27.630 --> 00:12:34.920
So what can we do about this problem? There
are a couple of strategies. One is that we
00:12:34.920 --> 00:12:42.960
use a model that specifically includes some terms
in the model, that model the known independence
00:12:42.960 --> 00:12:50.250
of the error term, which can be some quite
difficult to do if the pattern of dependency
00:12:50.250 --> 00:12:56.190
between observations is complex. Another
approach is that we use cluster robust standard
00:12:56.190 --> 00:13:02.220
errors which will allow you to take an arbitrary
correlation structure between observations into
00:13:02.220 --> 00:13:07.770
account and that is a very general strategy
and I will explain that in another video.