WEBVTT
00:00:00.180 --> 00:00:05.850
In this video, I will show you one
possible workflow for regression analysis.
00:00:05.850 --> 00:00:09.180
This workflow will address
all the assumptions that are
00:00:09.180 --> 00:00:11.790
empirically testable after regression analysis.
00:00:11.790 --> 00:00:14.070
There are, of course, multiple different ways of
00:00:14.070 --> 00:00:17.190
testing assumptions. But this
is the way I like to do it.
00:00:17.190 --> 00:00:19.530
I'm using R for this example.
00:00:19.530 --> 00:00:24.630
But all of these tests and diagnostics
can be done with Stata as well.
00:00:24.630 --> 00:00:27.690
And most of them can be done with SPSS.
00:00:27.690 --> 00:00:33.780
Regression analysis workflow, and
any other statistical analysis
00:00:33.780 --> 00:00:37.530
workflow first starts by stating
a hypothesis that we want to test,
00:00:37.530 --> 00:00:41.190
then we collect some data
for testing the hypothesis.
00:00:41.190 --> 00:00:46.410
After that, we explore data so it is
important to understand the relationships.
00:00:46.410 --> 00:00:49.320
Then we estimate the first regression model,
00:00:49.320 --> 00:00:52.710
where we have the independent
variables and the dependent variable.
00:00:52.710 --> 00:00:57.150
Then we check the results
briefly, to see what they're like.
00:00:57.150 --> 00:00:59.100
And we proceed with diagnostics.
00:00:59.100 --> 00:01:06.180
So the diagnostics include various plots,
and I prefer plots over statistical tests.
00:01:06.180 --> 00:01:11.190
The reason is that while you can, for
example, do a test for heteroskedasticity.
00:01:11.190 --> 00:01:14.700
That test will only tell you
whether there's a problem or not,
00:01:14.700 --> 00:01:17.400
it will not tell you the nature of the problem.
00:01:17.400 --> 00:01:21.630
It is much more informative to look at the actual
00:01:21.630 --> 00:01:27.030
distribution of the residuals to see what
is the heteroskedasticity problem like.
00:01:27.030 --> 00:01:30.330
And also if you just look or eyeball these graphs,
00:01:30.330 --> 00:01:35.400
you will basically identify the same
thing that the test tells for you.
00:01:35.400 --> 00:01:40.020
So I don't generally use tests
unless someone asked me to do so.
00:01:40.020 --> 00:01:44.310
Then when I have done the diagnostics, I
figure out what is the biggest problem.
00:01:44.310 --> 00:01:50.730
And once I have fixed the biggest problem,
then I go back to do a regression model.
00:01:50.730 --> 00:01:53.670
For example, I may identify that there are some
00:01:53.670 --> 00:01:56.670
nonlinear relationships that
I didn't think of in advance,
00:01:56.670 --> 00:01:59.250
or I may identify some outliers,
00:01:59.250 --> 00:02:05.520
or I may identify some heteroskedasticity,
I go back to fit another regression model,
00:02:05.520 --> 00:02:08.970
where I have fixed the problem,
then I do diagnostics again.
00:02:08.970 --> 00:02:15.510
And once I'm happy, then I conclude that
that is my final model after the diagnostics,
00:02:15.510 --> 00:02:20.340
I do possibly nested model tests
against alternative models.
00:02:20.340 --> 00:02:24.840
And then comes the fun part, I interpret
what the regression coefficients mean.
00:02:24.840 --> 00:02:28.770
So I don't just state that there
is some coefficient of 0.02.
00:02:28.770 --> 00:02:31.950
I tell what it means in my
particular research context.
00:02:31.950 --> 00:02:35.730
And that is the hard part in regression analysis.
00:02:35.730 --> 00:02:40.620
To demonstrate the regression
analysis, diagnostics,
00:02:40.620 --> 00:02:44.130
reading some data, we are going to
be using the prestige dataset again.
00:02:44.130 --> 00:02:48.390
And our dependent variable
is the prestige this time,
00:02:48.390 --> 00:02:52.680
and we're going to be using education income
and share of women as independent variables.
00:02:52.680 --> 00:02:54.600
So that is a regression model.
00:02:54.600 --> 00:03:00.780
And the regression estimates are here,
we have gone through these estimates
00:03:00.780 --> 00:03:04.560
before in a previous video,
so I will not explain them in detail.
00:03:04.560 --> 00:03:09.690
Instead, I'm going to be focusing
now on the assumptions checking.
00:03:09.690 --> 00:03:12.960
So how do we know that the six
regression assumptions actually hold.
00:03:12.960 --> 00:03:15.780
The assumptions are shown here,
00:03:15.780 --> 00:03:20.400
the assumptions are that all
relationships are linear.
00:03:20.400 --> 00:03:23.370
So it's a linear model,
observations are independent.
00:03:23.370 --> 00:03:28.080
So independence of observation
comes from our research design.
00:03:28.080 --> 00:03:31.350
And in cross-sectional study,
it is difficult to test.
00:03:31.350 --> 00:03:33.660
If you have a longitudinal study,
00:03:33.660 --> 00:03:36.960
then you can do some checks for
independence of observations.
00:03:36.960 --> 00:03:41.490
No perfect collinearity and non-zero
variances of independent variables.
00:03:41.490 --> 00:03:47.880
That happens if two or more variables
perfectly determine one another.
00:03:47.880 --> 00:03:50.970
So if you have a categorical
variable of three categories,
00:03:50.970 --> 00:03:54.750
then including three dummies
leads to this problem,
00:03:54.750 --> 00:03:57.750
because once you know two dummies,
you know the third value.
00:03:57.750 --> 00:04:02.070
Also non-zero variance,
if you have zero variance,
00:04:02.070 --> 00:04:04.560
for example,
if you are studying
00:04:04.560 --> 00:04:09.870
the effects of gender, and you have no women in
the sample, then you have no variance in gender.
00:04:09.870 --> 00:04:11.520
So that is another implication.
00:04:12.060 --> 00:04:14.010
Another reason why this could occur.
00:04:14.010 --> 00:04:17.670
We know that this is not a problem in our data.
00:04:17.670 --> 00:04:19.560
Because if it was a problem,
00:04:19.560 --> 00:04:25.620
we couldn't even estimate the regression model,
because we got regression estimates that indicates
00:04:25.620 --> 00:04:27.700
that we don't have problem
with the third assumption.
00:04:27.700 --> 00:04:32.260
The other assumptions are a bit more problematic,
00:04:32.260 --> 00:04:35.980
because they are about the error
term and we can't observe their term.
00:04:35.980 --> 00:04:41.230
So the fourth assumption was the
terms expected value of zero given
00:04:41.230 --> 00:04:45.100
any values of independent variables,
then error term has equal variance,
00:04:45.100 --> 00:04:48.880
this is the homoskedasticity assumption and
then the error term is normally distributed.
00:04:48.880 --> 00:04:54.640
How we test these assumptions about the
error term, these three assumptions,
00:04:54.640 --> 00:04:59.110
is that we use the residuals
as estimates of the error term.
00:04:59.110 --> 00:05:03.880
So if observation is far from the
regression line in the population,
00:05:03.880 --> 00:05:05.530
there's a large value of the error term,
00:05:05.530 --> 00:05:10.510
then we can expect that it
also has a large residual.
00:05:10.510 --> 00:05:13.600
So we can use the residuals
as estimates of error terms.
00:05:13.600 --> 00:05:18.220
So normally doing regression in
diagnostics is analyzing the residuals.
00:05:18.220 --> 00:05:19.840
And that's quite natural.
00:05:19.840 --> 00:05:24.520
Because if you think that residual is a part
of the data that the model doesn't explain,
00:05:24.520 --> 00:05:31.000
and the idea of diagnostics is that we check
if the model explains the data adequately,
00:05:31.000 --> 00:05:34.330
then it's quite natural to
look at the part of the data,
00:05:34.330 --> 00:05:38.770
the model doesn't explain for
clues of what could go wrong.
00:05:38.770 --> 00:05:45.190
I normally start with, the
normal Q-Q plot of the residuals.
00:05:45.190 --> 00:05:50.110
And the normal Q-Q plot is something that
00:05:50.110 --> 00:05:53.890
quantifies whether the regression and
residuals are normally distributed.
00:05:53.890 --> 00:06:03.220
So it compares, the residuals here, or these
calculated based on standardized residuals,
00:06:03.220 --> 00:06:07.480
there are different kinds of residuals,
for an applied researcher,
00:06:07.480 --> 00:06:10.000
it doesn't really matter if we know them all.
00:06:10.000 --> 00:06:15.130
What's important is that your
software will calculate the
00:06:15.130 --> 00:06:18.070
right kind of residual for you automatically.
00:06:18.910 --> 00:06:20.080
When you do these plots.
00:06:20.080 --> 00:06:25.600
Then you have normal distributions, we're
comparing residuals against normal distribution,
00:06:25.600 --> 00:06:30.670
we can see here that they,
roughly correspond.
00:06:30.670 --> 00:06:34.210
So we have a line here indicates that
residuals are normally distributed.
00:06:34.210 --> 00:06:38.410
Here's the problem are we have a
chi-square distributed or two here.
00:06:38.410 --> 00:06:44.980
So the residuals here, are further
from mean than they're supposed to be.
00:06:44.980 --> 00:06:48.880
And here we have inverse, we have
uniform distribution of the error term.
00:06:48.880 --> 00:06:53.080
And that creates this kind of S
shape in the in the normal Q-Q plot.
00:06:53.080 --> 00:06:59.230
While the normality of the error term is not
an important assumption in regression analysis,
00:06:59.230 --> 00:07:05.200
I nevertheless do this because it usually,
it's quick to do and it identifies outliers
00:07:05.200 --> 00:07:08.560
for me, and it gives me a kind of like,
a first look at the data.
00:07:08.560 --> 00:07:12.280
Here with the actual data,
00:07:12.280 --> 00:07:16.510
I can see that the residuals
follow normal distribution.
00:07:16.510 --> 00:07:18.190
So I'm happy with this,
00:07:18.190 --> 00:07:25.780
this is an indication of a good fitting model,
on if we think they are the sixth assumption.
00:07:25.780 --> 00:07:29.680
R labels these possible outliers.
00:07:29.680 --> 00:07:32.530
So newsboys has a large negative residual.
00:07:32.530 --> 00:07:37.390
So newsboys is less prestigious
than with the model predicts.
00:07:37.390 --> 00:07:41.590
And farmers are more prestigious,
what the model predicts.
00:07:41.590 --> 00:07:43.810
So farmers don't make much money.
00:07:43.810 --> 00:07:46.540
And you don't need high education to be a farmer.
00:07:46.540 --> 00:07:48.670
But farmers are still appreciated a lot.
00:07:48.670 --> 00:07:52.840
So that's an other extreme case.
00:07:52.840 --> 00:07:57.970
So normal QQ plot shows that
the residuals are roughly
00:07:57.970 --> 00:08:00.040
normally distributed, and that's a good thing.
00:08:00.040 --> 00:08:06.370
So we conclude no problems,
then we start looking at more complicated plots.
00:08:06.370 --> 00:08:09.340
The next plot is the residual versus fitted plot.
00:08:09.340 --> 00:08:13.060
And the idea of residual versus
fitted plot is that it allows us
00:08:13.060 --> 00:08:17.170
to test for nonlinearities and
heteroskedacity in the data.
00:08:17.170 --> 00:08:21.670
So the fitted value is calculated
based on the regression equation.
00:08:21.670 --> 00:08:27.250
We multiply these are variables
with the regression coefficients,
00:08:27.250 --> 00:08:30.280
and then we compare residual versus fitted.
00:08:30.280 --> 00:08:35.830
Ideally, there is no pattern here,
the residuals and fitted values,
00:08:35.830 --> 00:08:37.930
they are just spread out.
00:08:37.930 --> 00:08:41.260
So this is an indication of a well fitting model.
00:08:41.260 --> 00:08:44.560
In this regard, here we have
a heteroskedasticity problem.
00:08:44.560 --> 00:08:51.070
So that plot contains data where
the variation of the residual,
00:08:51.070 --> 00:08:55.780
and also there is an error term,
he saw a lot less here in the middle,
00:08:55.780 --> 00:08:58.660
and then it opens up to the left and to the right.
00:08:58.660 --> 00:09:01.090
So this is a butterfly shape of residuals.
00:09:01.090 --> 00:09:05.440
And this is the worst kind of
heteroskedacity problem that you could have.
00:09:05.440 --> 00:09:07.630
But it's not very, very realistic,
00:09:07.630 --> 00:09:12.790
because it's difficult to think of what kind
of process will generate this kind of data.
00:09:12.790 --> 00:09:19.390
Then here, we have a nonlinearity
and some heteroskedasticity problems.
00:09:19.390 --> 00:09:21.550
So this is a megaphone opening, right?
00:09:21.550 --> 00:09:25.270
And it appears that there's
slight nonlinearity here,
00:09:25.270 --> 00:09:27.610
we have here severe nonlinearity.
00:09:27.610 --> 00:09:32.890
So the right formula, right shape is not line,
but it's a curve here.
00:09:32.890 --> 00:09:39.250
And this is a weird looking dataset
that has a nonlinearity problem.
00:09:39.250 --> 00:09:42.700
And also it has a heteroskedasticity problem.
00:09:42.700 --> 00:09:44.800
So the plot,
00:09:44.800 --> 00:09:48.730
we want to have something that looks
like that no particular pattern.
00:09:48.730 --> 00:09:54.460
So typically, in these diagnostic plots,
that plot residual against something else,
00:09:54.460 --> 00:09:56.320
you are looking for an old pattern.
00:09:56.320 --> 00:10:00.430
Our residual versus fitted plot looks like that.
00:10:00.430 --> 00:10:07.120
So, we have marked again, these observations
with high residuals in absolute value.
00:10:07.120 --> 00:10:11.620
And then we can see that we have fitted values,
00:10:11.620 --> 00:10:17.050
there are very few or a few professions for
which the model predicts high prestigiuousness.
00:10:17.050 --> 00:10:21.670
And most observations are between 30 and 70.
00:10:21.670 --> 00:10:24.760
So what can we infer from this plot,
00:10:24.760 --> 00:10:31.480
we can infer that maybe the variance of the
estimates decreases slightly to the right.
00:10:31.480 --> 00:10:34.000
So we don't have much observations here.
00:10:34.000 --> 00:10:38.020
So we don't know if this is
actually the same dispersion
00:10:38.020 --> 00:10:41.200
here, but we just observe two
values from that dispersion.
00:10:41.200 --> 00:10:45.760
But it is possible that,
if you look at this, this person here
00:10:46.870 --> 00:10:51.760
that much and look at this person here,
it's a slightly less,
00:10:51.760 --> 00:10:54.550
so it is possible that we have
heteroskedasticity problem.
00:10:54.550 --> 00:10:59.830
So the fifth assumption does not
hold whether that is severe enough
00:10:59.830 --> 00:11:03.610
to warrant using the heteroskedacity
as this is robust, standard errors.
00:11:03.610 --> 00:11:04.960
That is a bit unclear,
00:11:04.960 --> 00:11:10.420
because this is not clear case
of where we should use those.
00:11:10.420 --> 00:11:12.700
Then we check for outliers.
00:11:12.700 --> 00:11:19.360
So this far, we have been looking for
evidence for heteroskedacity and nonlinearity.
00:11:19.360 --> 00:11:23.560
We have found evidence for heteroskedasticity,
but not really for nonlinearities.
00:11:23.560 --> 00:11:28.150
Then we are looking for outliers as
the final step using the fourth plot.
00:11:28.150 --> 00:11:36.670
And the residual versus leverage plot,
tells us which observations are influenced.
00:11:36.670 --> 00:11:42.790
So we're looking here at observations that
have a high leverage and high residual.
00:11:42.790 --> 00:11:51.040
So we have general managers who have high
leverage and a high residual in absolute value.
00:11:51.040 --> 00:11:53.410
So we want to look for observations,
00:11:53.410 --> 00:11:57.970
with residual that are larged in
absolute value, absolute magnitude.
00:11:57.970 --> 00:12:06.670
In Stata, for example, Stata uses the squared
residual here, because that always goes up.
00:12:06.670 --> 00:12:10.750
So it's easier to see which
observations have large residuals,
00:12:10.750 --> 00:12:11.290
so we can,
00:12:11.290 --> 00:12:16.510
we have to look at small negative values,
or large positive values here.
00:12:16.510 --> 00:12:21.370
So it's not as simple as if it was
if this was square of the residual.
00:12:21.370 --> 00:12:27.400
So minister has leverage,
newsboys has a large residual,
00:12:27.400 --> 00:12:32.200
then general managers is here.
The cooks distance is another
00:12:32.200 --> 00:12:40.240
measure of influence and observations with
large, cooks distance are potential outliers.
00:12:40.240 --> 00:12:45.760
As before, in the Deephouse paper
to deal with these outliers,
00:12:45.760 --> 00:12:53.230
we will be looking at why the prestigiousness of
one occupation would be different than others.
00:12:53.230 --> 00:12:55.210
So for example, general managers,
00:12:55.780 --> 00:13:00.070
they earn a lot of money,
so their salaries are high.
00:13:00.070 --> 00:13:04.330
And therefore their predictive
prestigiousness should be high
00:13:04.330 --> 00:13:06.640
as well, because it depends on the income.
00:13:06.640 --> 00:13:10.330
And they earn less than what the model predicts,
00:13:10.330 --> 00:13:16.180
which means that the model over predicts their
prestigiousness because of the high income.
00:13:16.180 --> 00:13:22.370
So that could be one reason why
we could drop general managers,
00:13:22.370 --> 00:13:27.440
but you have to use your own judgment,
because this is 102 observations.
00:13:27.440 --> 00:13:32.930
So dropping one observation increases
our sample size by 1%, approximately.
00:13:32.930 --> 00:13:36.260
So that could be consequencial.
00:13:36.260 --> 00:13:40.970
So we have the leverage the distance
from the mass center of the data,
00:13:40.970 --> 00:13:41.780
conceptually,
00:13:41.780 --> 00:13:45.410
and cooks distance is
another measure of influence.
00:13:45.410 --> 00:13:51.860
So we identify outliers using this plot,
then we start looking at the final plot,
00:13:51.860 --> 00:13:54.890
which is the added-variable plot.
00:13:54.890 --> 00:13:59.840
So added-variable plot quantifies the
relationship within the dependent variable,
00:13:59.840 --> 00:14:02.690
and one independent variable at a time.
00:14:02.690 --> 00:14:08.930
And this plot is interesting,
it tells us plots,
00:14:08.930 --> 00:14:16.730
education, that is the focal independent variable
regressed on the other independent variables here,
00:14:16.730 --> 00:14:19.280
the others,
and it takes the residual.
00:14:19.280 --> 00:14:26.570
So this is the part of education that is
not explained by income or share of women.
00:14:26.570 --> 00:14:29.540
So that's, if you think about the Venn diagram,
00:14:29.540 --> 00:14:34.310
presentation of regression analysis,
this is the part of the education
00:14:34.310 --> 00:14:38.120
that does not overlap with any of
the other independent variables.
00:14:38.120 --> 00:14:44.240
Then we have prestige, the regression of
prestige on other independent variables
00:14:44.240 --> 00:14:48.830
and we take the residual.
So we take what is unique of prestige,
00:14:48.830 --> 00:14:54.950
and unique of education after parceling out the
influence of all other variables in the model.
00:14:54.950 --> 00:14:57.980
And then we draw a line through that beta.
00:14:57.980 --> 00:15:05.210
And this is actually the regression
line of prestige on education.
00:15:05.210 --> 00:15:10.220
So one way to calculate regression line
is to regress both variables independent
00:15:10.220 --> 00:15:12.710
and dependent on all other independent variables.
00:15:12.710 --> 00:15:16.880
And then run the regression analysis
using just one independent variable.
00:15:16.880 --> 00:15:18.920
It produces the exact same result,
00:15:18.920 --> 00:15:22.130
as would producing,
including this
00:15:22.130 --> 00:15:26.600
education with all of the other variables
directly in multiple regression analysis.
00:15:26.600 --> 00:15:32.870
This plot allows us to look for nonlinearities
and heteroskedacity in a more refined manner.
00:15:32.870 --> 00:15:41.510
So what we can identify from here is that
the effects of income look pretty weird.
00:15:41.510 --> 00:15:46.400
We want to have observations that are
banded as a band around the regression line.
00:15:46.400 --> 00:15:50.900
And here you can see that it
looks more like a bit of a curve,
00:15:50.900 --> 00:15:54.080
it goes up here and then flattens out a bit.
00:15:54.080 --> 00:15:59.150
And we also have much more
dispersion here than dispersion here.
00:15:59.150 --> 00:16:01.640
Now, we have done the diagnostics.
00:16:01.640 --> 00:16:07.220
So we did all the normal Q-Q plot,
then we did the residual versus fitted plot,
00:16:07.220 --> 00:16:12.710
we did they are influence plot or the
outlier plot, and added variable plot.
00:16:12.710 --> 00:16:15.950
And now we have to decide what
do we want to do with the model.
00:16:15.950 --> 00:16:22.250
And some ideas that we could try is to use
heteroskedasticity robust standard errors,
00:16:22.250 --> 00:16:23.990
our sample size is so small,
00:16:23.990 --> 00:16:28.670
and there is no clear evidence of
serious heteroskedasticity problem.
00:16:28.670 --> 00:16:34.670
So in this case, I would probably use
the normal conventional standard errors,
00:16:34.670 --> 00:16:39.260
consider dropping general managers
and see if the results change.
00:16:39.260 --> 00:16:42.710
Even if we decide to keep general managers in our
00:16:42.710 --> 00:16:45.860
sample, that could work as a
robustness check in the paper.
00:16:45.860 --> 00:16:49.490
So in the Deephouse's paper,
they estimated the same model
00:16:49.490 --> 00:16:53.630
with the one outlier observation and without
the outlier and then compare the results.
00:16:53.630 --> 00:16:57.230
And we should consider log
transformation of income,
00:16:57.230 --> 00:17:02.270
consider the income in relative
terms makes a lot more sense anyway.
00:17:02.270 --> 00:17:07.430
Because when you think of all races, for example,
or you want to switch to a new job,
00:17:07.430 --> 00:17:13.160
then you typically want to negotiate a
salary increase relative your current level.
00:17:13.160 --> 00:17:15.170
Also additional salary,
00:17:15.170 --> 00:17:22.415
how much it increases your quality of
life depends on the current salary level.
00:17:22.415 --> 00:17:28.940
So if you give 1000 euros to somebody who makes
1000 euros per month, that's a big difference.
00:17:28.940 --> 00:17:33.710
If you give 1000 euros to somebody
who makes 5000 euros a month,
00:17:33.710 --> 00:17:35.300
it's a smaller difference.
00:17:35.300 --> 00:17:38.810
So income company revenues,
00:17:38.810 --> 00:17:42.590
that kind of quantities we typically
want to consider in relative terms.
00:17:42.590 --> 00:17:45.200
And to do that we use the log transformers.