WEBVTT

00:00:00.180 --> 00:00:05.850
In this video, I will show you one&nbsp;
possible workflow for regression analysis.

00:00:05.850 --> 00:00:09.180
This workflow will address&nbsp;
all the assumptions that are&nbsp;&nbsp;

00:00:09.180 --> 00:00:11.790
empirically testable after regression analysis.

00:00:11.790 --> 00:00:14.070
There are, of course, multiple different ways of&nbsp;&nbsp;

00:00:14.070 --> 00:00:17.190
testing assumptions. But this&nbsp;
is the way I like to do it.

00:00:17.190 --> 00:00:19.530
I'm using R for this example.

00:00:19.530 --> 00:00:24.630
But all of these tests and diagnostics&nbsp;
can be done with Stata as well.

00:00:24.630 --> 00:00:27.690
And most of them can be done with SPSS.

00:00:27.690 --> 00:00:33.780
Regression analysis workflow, and&nbsp;
any other statistical analysis&nbsp;&nbsp;

00:00:33.780 --> 00:00:37.530
workflow first starts by stating&nbsp;
a hypothesis that we want to test,

00:00:37.530 --> 00:00:41.190
then we collect some data&nbsp;
for testing the hypothesis.

00:00:41.190 --> 00:00:46.410
After that, we explore data so it is&nbsp;
important to understand the relationships.

00:00:46.410 --> 00:00:49.320
Then we estimate the first regression model,&nbsp;&nbsp;

00:00:49.320 --> 00:00:52.710
where we have the independent&nbsp;
variables and the dependent variable.

00:00:52.710 --> 00:00:57.150
Then we check the results&nbsp;
briefly, to see what they're like.

00:00:57.150 --> 00:00:59.100
And we proceed with diagnostics.

00:00:59.100 --> 00:01:06.180
So the diagnostics include various plots,&nbsp;
and I prefer plots over statistical tests.

00:01:06.180 --> 00:01:11.190
The reason is that while you can, for&nbsp;
example, do a test for heteroskedasticity.

00:01:11.190 --> 00:01:14.700
That test will only tell you&nbsp;
whether there's a problem or not,&nbsp;&nbsp;

00:01:14.700 --> 00:01:17.400
it will not tell you the nature of the problem.

00:01:17.400 --> 00:01:21.630
It is much more informative to look at the actual&nbsp;&nbsp;

00:01:21.630 --> 00:01:27.030
distribution of the residuals to see what&nbsp;
is the heteroskedasticity problem like.

00:01:27.030 --> 00:01:30.330
And also if you just look or eyeball these graphs,&nbsp;&nbsp;

00:01:30.330 --> 00:01:35.400
you will basically identify the same&nbsp;
thing that the test tells for you.

00:01:35.400 --> 00:01:40.020
So I don't generally use tests&nbsp;
unless someone asked me to do so.

00:01:40.020 --> 00:01:44.310
Then when I have done the diagnostics, I&nbsp;
figure out what is the biggest problem.

00:01:44.310 --> 00:01:50.730
And once I have fixed the biggest problem,&nbsp;
then I go back to do a regression model.

00:01:50.730 --> 00:01:53.670
For example, I may identify that there are some&nbsp;&nbsp;

00:01:53.670 --> 00:01:56.670
nonlinear relationships that&nbsp;
I didn't think of in advance,

00:01:56.670 --> 00:01:59.250
or I may identify some outliers,&nbsp;

00:01:59.250 --> 00:02:05.520
or I may identify some heteroskedasticity,
I go back to fit another regression model,&nbsp;

00:02:05.520 --> 00:02:08.970
where I have fixed the problem,
then I do diagnostics again.

00:02:08.970 --> 00:02:15.510
And once I'm happy, then I conclude that&nbsp;
that is my final model after the diagnostics,&nbsp;&nbsp;

00:02:15.510 --> 00:02:20.340
I do possibly nested model tests&nbsp;
against alternative models.

00:02:20.340 --> 00:02:24.840
And then comes the fun part, I interpret&nbsp;
what the regression coefficients mean.

00:02:24.840 --> 00:02:28.770
So I don't just state that there&nbsp;
is some coefficient of 0.02.&nbsp;

00:02:28.770 --> 00:02:31.950
I tell what it means in my&nbsp;
particular research context.

00:02:31.950 --> 00:02:35.730
And that is the hard part in regression analysis.

00:02:35.730 --> 00:02:40.620
To demonstrate the regression&nbsp;
analysis, diagnostics,&nbsp;&nbsp;

00:02:40.620 --> 00:02:44.130
reading some data, we are going to&nbsp;
be using the prestige dataset again.

00:02:44.130 --> 00:02:48.390
And our dependent variable&nbsp;
is the prestige this time,&nbsp;&nbsp;

00:02:48.390 --> 00:02:52.680
and we're going to be using education income&nbsp;
and share of women as independent variables.

00:02:52.680 --> 00:02:54.600
So that is a regression model.

00:02:54.600 --> 00:03:00.780
And the regression estimates are here,
we have gone through these estimates&nbsp;&nbsp;

00:03:00.780 --> 00:03:04.560
before in a previous video,
so I will not explain them in detail.

00:03:04.560 --> 00:03:09.690
Instead, I'm going to be focusing&nbsp;
now on the assumptions checking.

00:03:09.690 --> 00:03:12.960
So how do we know that the six&nbsp;
regression assumptions actually hold.

00:03:12.960 --> 00:03:15.780
The assumptions are shown here,&nbsp;

00:03:15.780 --> 00:03:20.400
the assumptions are that all&nbsp;
relationships are linear.

00:03:20.400 --> 00:03:23.370
So it's a linear model,&nbsp;
observations are independent.

00:03:23.370 --> 00:03:28.080
So independence of observation&nbsp;
comes from our research design.

00:03:28.080 --> 00:03:31.350
And in cross-sectional study,&nbsp;
it is difficult to test.

00:03:31.350 --> 00:03:33.660
If you have a longitudinal study,&nbsp;

00:03:33.660 --> 00:03:36.960
then you can do some checks for&nbsp;
independence of observations.

00:03:36.960 --> 00:03:41.490
No perfect collinearity and non-zero&nbsp;
variances of independent variables.

00:03:41.490 --> 00:03:47.880
That happens if two or more variables&nbsp;
perfectly determine one another.

00:03:47.880 --> 00:03:50.970
So if you have a categorical&nbsp;
variable of three categories,&nbsp;

00:03:50.970 --> 00:03:54.750
then including three dummies&nbsp;
leads to this problem,&nbsp;

00:03:54.750 --> 00:03:57.750
because once you know two dummies,
you know the third value.

00:03:57.750 --> 00:04:02.070
Also non-zero variance,
if you have zero variance,&nbsp;

00:04:02.070 --> 00:04:04.560
for example,
if you are studying&nbsp;&nbsp;

00:04:04.560 --> 00:04:09.870
the effects of gender, and you have no women in&nbsp;
the sample, then you have no variance in gender.

00:04:09.870 --> 00:04:11.520
So that is another implication.

00:04:12.060 --> 00:04:14.010
Another reason why this could occur.

00:04:14.010 --> 00:04:17.670
We know that this is not a problem in our data.

00:04:17.670 --> 00:04:19.560
Because if it was a problem,&nbsp;

00:04:19.560 --> 00:04:25.620
we couldn't even estimate the regression model,
because we got regression estimates that indicates&nbsp;&nbsp;

00:04:25.620 --> 00:04:27.700
that we don't have problem&nbsp;
with the third assumption.

00:04:27.700 --> 00:04:32.260
The other assumptions are a bit more problematic,&nbsp;

00:04:32.260 --> 00:04:35.980
because they are about the error&nbsp;
term and we can't observe their term.

00:04:35.980 --> 00:04:41.230
So the fourth assumption was the&nbsp;
terms expected value of zero given&nbsp;&nbsp;

00:04:41.230 --> 00:04:45.100
any values of independent variables,
then error term has equal variance,&nbsp;

00:04:45.100 --> 00:04:48.880
this is the homoskedasticity assumption and&nbsp;
then the error term is normally distributed.

00:04:48.880 --> 00:04:54.640
How we test these assumptions about the&nbsp;
error term, these three assumptions,&nbsp;

00:04:54.640 --> 00:04:59.110
is that we use the residuals&nbsp;
as estimates of the error term.

00:04:59.110 --> 00:05:03.880
So if observation is far from the&nbsp;
regression line in the population,&nbsp;

00:05:03.880 --> 00:05:05.530
there's a large value of the error term,&nbsp;

00:05:05.530 --> 00:05:10.510
then we can expect that it&nbsp;
also has a large residual.

00:05:10.510 --> 00:05:13.600
So we can use the residuals&nbsp;
as estimates of error terms.

00:05:13.600 --> 00:05:18.220
So normally doing regression in&nbsp;
diagnostics is analyzing the residuals.

00:05:18.220 --> 00:05:19.840
And that's quite natural.

00:05:19.840 --> 00:05:24.520
Because if you think that residual is a part&nbsp;
of the data that the model doesn't explain,&nbsp;

00:05:24.520 --> 00:05:31.000
and the idea of diagnostics is that we check&nbsp;
if the model explains the data adequately,&nbsp;

00:05:31.000 --> 00:05:34.330
then it's quite natural to&nbsp;
look at the part of the data,&nbsp;

00:05:34.330 --> 00:05:38.770
the model doesn't explain for&nbsp;
clues of what could go wrong.

00:05:38.770 --> 00:05:45.190
I normally start with, the&nbsp;
normal Q-Q plot of the residuals.

00:05:45.190 --> 00:05:50.110
And the normal Q-Q plot is something that&nbsp;&nbsp;

00:05:50.110 --> 00:05:53.890
quantifies whether the regression and&nbsp;
residuals are normally distributed.

00:05:53.890 --> 00:06:03.220
So it compares, the residuals here, or these&nbsp;
calculated based on standardized residuals,&nbsp;

00:06:03.220 --> 00:06:07.480
there are different kinds of residuals,
for an applied researcher,&nbsp;

00:06:07.480 --> 00:06:10.000
it doesn't really matter if we know them all.

00:06:10.000 --> 00:06:15.130
What's important is that your&nbsp;
software will calculate the&nbsp;&nbsp;

00:06:15.130 --> 00:06:18.070
right kind of residual for you automatically.

00:06:18.910 --> 00:06:20.080
When you do these plots.

00:06:20.080 --> 00:06:25.600
Then you have normal distributions, we're&nbsp;
comparing residuals against normal distribution,&nbsp;

00:06:25.600 --> 00:06:30.670
we can see here that they,
roughly correspond.

00:06:30.670 --> 00:06:34.210
So we have a line here indicates that&nbsp;
residuals are normally distributed.

00:06:34.210 --> 00:06:38.410
Here's the problem are we have a&nbsp;
chi-square distributed or two here.&nbsp;

00:06:38.410 --> 00:06:44.980
So the residuals here, are further&nbsp;
from mean than they're supposed to be.

00:06:44.980 --> 00:06:48.880
And here we have inverse, we have&nbsp;
uniform distribution of the error term.&nbsp;

00:06:48.880 --> 00:06:53.080
And that creates this kind of S&nbsp;
shape in the in the normal Q-Q plot.

00:06:53.080 --> 00:06:59.230
While the normality of the error term is not&nbsp;
an important assumption in regression analysis,

00:06:59.230 --> 00:07:05.200
I nevertheless do this because it usually,
it's quick to do and it identifies outliers&nbsp;&nbsp;

00:07:05.200 --> 00:07:08.560
for me, and it gives me a kind of like,
a first look at the data.

00:07:08.560 --> 00:07:12.280
Here with the actual data,&nbsp;

00:07:12.280 --> 00:07:16.510
I can see that the residuals&nbsp;
follow normal distribution.

00:07:16.510 --> 00:07:18.190
So I'm happy with this,&nbsp;

00:07:18.190 --> 00:07:25.780
this is an indication of a good fitting model,
on if we think they are the sixth assumption.

00:07:25.780 --> 00:07:29.680
R labels these possible outliers.

00:07:29.680 --> 00:07:32.530
So newsboys has a large negative residual.

00:07:32.530 --> 00:07:37.390
So newsboys is less prestigious&nbsp;
than with the model predicts.

00:07:37.390 --> 00:07:41.590
And farmers are more prestigious,
what the model predicts.

00:07:41.590 --> 00:07:43.810
So farmers don't make much money.

00:07:43.810 --> 00:07:46.540
And you don't need high education to be a farmer.

00:07:46.540 --> 00:07:48.670
But farmers are still appreciated a lot.

00:07:48.670 --> 00:07:52.840
So that's an other extreme case.

00:07:52.840 --> 00:07:57.970
So normal QQ plot shows that&nbsp;
the residuals are roughly&nbsp;&nbsp;

00:07:57.970 --> 00:08:00.040
normally distributed, and that's a good thing.

00:08:00.040 --> 00:08:06.370
So we conclude no problems,
then we start looking at more complicated plots.

00:08:06.370 --> 00:08:09.340
The next plot is the residual versus fitted plot.

00:08:09.340 --> 00:08:13.060
And the idea of residual versus&nbsp;
fitted plot is that it allows us&nbsp;&nbsp;

00:08:13.060 --> 00:08:17.170
to test for nonlinearities and&nbsp;
heteroskedacity in the data.

00:08:17.170 --> 00:08:21.670
So the fitted value is calculated&nbsp;
based on the regression equation.

00:08:21.670 --> 00:08:27.250
We multiply these are variables&nbsp;
with the regression coefficients,&nbsp;&nbsp;

00:08:27.250 --> 00:08:30.280
and then we compare residual versus fitted.

00:08:30.280 --> 00:08:35.830
Ideally, there is no pattern here,
the residuals and fitted values,&nbsp;

00:08:35.830 --> 00:08:37.930
they are just spread out.

00:08:37.930 --> 00:08:41.260
So this is an indication of a well fitting model.

00:08:41.260 --> 00:08:44.560
In this regard, here we have&nbsp;
a heteroskedasticity problem.

00:08:44.560 --> 00:08:51.070
So that plot contains data where&nbsp;
the variation of the residual,&nbsp;

00:08:51.070 --> 00:08:55.780
and also there is an error term,
he saw a lot less here in the middle,&nbsp;

00:08:55.780 --> 00:08:58.660
and then it opens up to the left and to the right.

00:08:58.660 --> 00:09:01.090
So this is a butterfly shape of residuals.

00:09:01.090 --> 00:09:05.440
And this is the worst kind of&nbsp;
heteroskedacity problem that you could have.

00:09:05.440 --> 00:09:07.630
But it's not very, very realistic,&nbsp;

00:09:07.630 --> 00:09:12.790
because it's difficult to think of what kind&nbsp;
of process will generate this kind of data.

00:09:12.790 --> 00:09:19.390
Then here, we have a nonlinearity&nbsp;
and some heteroskedasticity problems.

00:09:19.390 --> 00:09:21.550
So this is a megaphone opening, right?

00:09:21.550 --> 00:09:25.270
And it appears that there's&nbsp;
slight nonlinearity here,

00:09:25.270 --> 00:09:27.610
we have here severe nonlinearity.

00:09:27.610 --> 00:09:32.890
So the right formula, right shape is not line,
but it's a curve here.

00:09:32.890 --> 00:09:39.250
And this is a weird looking dataset&nbsp;
that has a nonlinearity problem.

00:09:39.250 --> 00:09:42.700
And also it has a heteroskedasticity problem.

00:09:42.700 --> 00:09:44.800
So the plot,&nbsp;

00:09:44.800 --> 00:09:48.730
we want to have something that looks&nbsp;
like that no particular pattern.

00:09:48.730 --> 00:09:54.460
So typically, in these diagnostic plots,
that plot residual against something else,&nbsp;

00:09:54.460 --> 00:09:56.320
you are looking for an old pattern.

00:09:56.320 --> 00:10:00.430
Our residual versus fitted plot looks like that.

00:10:00.430 --> 00:10:07.120
So, we have marked again, these observations&nbsp;
with high residuals in absolute value.

00:10:07.120 --> 00:10:11.620
And then we can see that we have fitted values,&nbsp;

00:10:11.620 --> 00:10:17.050
there are very few or a few professions for&nbsp;
which the model predicts high prestigiuousness.

00:10:17.050 --> 00:10:21.670
And most observations are between 30 and 70.

00:10:21.670 --> 00:10:24.760
So what can we infer from this plot,&nbsp;

00:10:24.760 --> 00:10:31.480
we can infer that maybe the variance of the&nbsp;
estimates decreases slightly to the right.

00:10:31.480 --> 00:10:34.000
So we don't have much observations here.

00:10:34.000 --> 00:10:38.020
So we don't know if this is&nbsp;
actually the same dispersion&nbsp;&nbsp;

00:10:38.020 --> 00:10:41.200
here, but we just observe two&nbsp;
values from that dispersion.

00:10:41.200 --> 00:10:45.760
But it is possible that,
if you look at this, this person here&nbsp;&nbsp;

00:10:46.870 --> 00:10:51.760
that much and look at this person here,
it's a slightly less,

00:10:51.760 --> 00:10:54.550
so it is possible that we have&nbsp;
heteroskedasticity problem.

00:10:54.550 --> 00:10:59.830
So the fifth assumption does not&nbsp;
hold whether that is severe enough&nbsp;&nbsp;

00:10:59.830 --> 00:11:03.610
to warrant using the heteroskedacity&nbsp;
as this is robust, standard errors.

00:11:03.610 --> 00:11:04.960
That is a bit unclear,&nbsp;

00:11:04.960 --> 00:11:10.420
because this is not clear case&nbsp;
of where we should use those.

00:11:10.420 --> 00:11:12.700
Then we check for outliers.

00:11:12.700 --> 00:11:19.360
So this far, we have been looking for&nbsp;
evidence for heteroskedacity and nonlinearity.

00:11:19.360 --> 00:11:23.560
We have found evidence for heteroskedasticity,
but not really for nonlinearities.

00:11:23.560 --> 00:11:28.150
Then we are looking for outliers as&nbsp;
the final step using the fourth plot.

00:11:28.150 --> 00:11:36.670
And the residual versus leverage plot,
tells us which observations are influenced.

00:11:36.670 --> 00:11:42.790
So we're looking here at observations that&nbsp;
have a high leverage and high residual.

00:11:42.790 --> 00:11:51.040
So we have general managers who have high&nbsp;
leverage and a high residual in absolute value.

00:11:51.040 --> 00:11:53.410
So we want to look for observations,&nbsp;&nbsp;

00:11:53.410 --> 00:11:57.970
with residual that are larged in&nbsp;
absolute value, absolute magnitude.

00:11:57.970 --> 00:12:06.670
In Stata, for example, Stata uses the squared&nbsp;
residual here, because that always goes up.

00:12:06.670 --> 00:12:10.750
So it's easier to see which&nbsp;
observations have large residuals,&nbsp;

00:12:10.750 --> 00:12:11.290
so we can,&nbsp;

00:12:11.290 --> 00:12:16.510
we have to look at small negative values,
or large positive values here.

00:12:16.510 --> 00:12:21.370
So it's not as simple as if it was&nbsp;
if this was square of the residual.

00:12:21.370 --> 00:12:27.400
So minister has leverage,
newsboys has a large residual,&nbsp;

00:12:27.400 --> 00:12:32.200
then general managers is here.
 
The cooks distance is another&nbsp;&nbsp;

00:12:32.200 --> 00:12:40.240
measure of influence and observations with&nbsp;
large, cooks distance are potential outliers.

00:12:40.240 --> 00:12:45.760
As before, in the Deephouse paper&nbsp;
to deal with these outliers,&nbsp;

00:12:45.760 --> 00:12:53.230
we will be looking at why the prestigiousness of&nbsp;
one occupation would be different than others.

00:12:53.230 --> 00:12:55.210
So for example, general managers,&nbsp;

00:12:55.780 --> 00:13:00.070
they earn a lot of money,
so their salaries are high.

00:13:00.070 --> 00:13:04.330
And therefore their predictive&nbsp;
prestigiousness should be high&nbsp;&nbsp;

00:13:04.330 --> 00:13:06.640
as well, because it depends on the income.

00:13:06.640 --> 00:13:10.330
And they earn less than what the model predicts,&nbsp;

00:13:10.330 --> 00:13:16.180
which means that the model over predicts their&nbsp;
prestigiousness because of the high income.

00:13:16.180 --> 00:13:22.370
So that could be one reason why&nbsp;
we could drop general managers,&nbsp;

00:13:22.370 --> 00:13:27.440
but you have to use your own judgment,
because this is 102 observations.

00:13:27.440 --> 00:13:32.930
So dropping one observation increases&nbsp;
our sample size by 1%, approximately.

00:13:32.930 --> 00:13:36.260
So that could be consequencial.

00:13:36.260 --> 00:13:40.970
So we have the leverage the distance&nbsp;
from the mass center of the data,&nbsp;

00:13:40.970 --> 00:13:41.780
conceptually,&nbsp;

00:13:41.780 --> 00:13:45.410
and cooks distance is&nbsp;
another measure of influence.

00:13:45.410 --> 00:13:51.860
So we identify outliers using this plot,
then we start looking at the final plot,&nbsp;

00:13:51.860 --> 00:13:54.890
which is the added-variable plot.

00:13:54.890 --> 00:13:59.840
So added-variable plot quantifies the&nbsp;
relationship within the dependent variable,&nbsp;

00:13:59.840 --> 00:14:02.690
and one independent variable at a time.

00:14:02.690 --> 00:14:08.930
And this plot is interesting,
it tells us plots,&nbsp;

00:14:08.930 --> 00:14:16.730
education, that is the focal independent variable
regressed on the other independent variables here,&nbsp;

00:14:16.730 --> 00:14:19.280
the others,
and it takes the residual.

00:14:19.280 --> 00:14:26.570
So this is the part of education that is&nbsp;
not explained by income or share of women.

00:14:26.570 --> 00:14:29.540
So that's, if you think about the Venn diagram,&nbsp;

00:14:29.540 --> 00:14:34.310
presentation of regression analysis,
this is the part of the education&nbsp;

00:14:34.310 --> 00:14:38.120
that does not overlap with any of&nbsp;
the other independent variables.

00:14:38.120 --> 00:14:44.240
Then we have prestige, the regression of&nbsp;
prestige on other independent variables&nbsp;

00:14:44.240 --> 00:14:48.830
and we take the residual.
 
So we take what is unique of prestige,&nbsp;

00:14:48.830 --> 00:14:54.950
and unique of education after parceling out the&nbsp;
influence of all other variables in the model.

00:14:54.950 --> 00:14:57.980
And then we draw a line through that beta.

00:14:57.980 --> 00:15:05.210
And this is actually the regression&nbsp;
line of prestige on education.

00:15:05.210 --> 00:15:10.220
So one way to calculate regression line&nbsp;
is to regress both variables independent&nbsp;&nbsp;

00:15:10.220 --> 00:15:12.710
and dependent on all other independent variables.
 &nbsp;

00:15:12.710 --> 00:15:16.880
And then run the regression analysis&nbsp;
using just one independent variable.

00:15:16.880 --> 00:15:18.920
It produces the exact same result,&nbsp;

00:15:18.920 --> 00:15:22.130
as would producing,
including this&nbsp;&nbsp;

00:15:22.130 --> 00:15:26.600
education with all of the other variables&nbsp;
directly in multiple regression analysis.

00:15:26.600 --> 00:15:32.870
This plot allows us to look for nonlinearities&nbsp;
and heteroskedacity in a more refined manner.

00:15:32.870 --> 00:15:41.510
So what we can identify from here is that&nbsp;
the effects of income look pretty weird.

00:15:41.510 --> 00:15:46.400
We want to have observations that are&nbsp;
banded as a band around the regression line.

00:15:46.400 --> 00:15:50.900
And here you can see that it&nbsp;
looks more like a bit of a curve,&nbsp;

00:15:50.900 --> 00:15:54.080
it goes up here and then flattens out a bit.

00:15:54.080 --> 00:15:59.150
And we also have much more&nbsp;
dispersion here than dispersion here.

00:15:59.150 --> 00:16:01.640
Now, we have done the diagnostics.

00:16:01.640 --> 00:16:07.220
So we did all the normal Q-Q plot,
then we did the residual versus fitted plot,&nbsp;

00:16:07.220 --> 00:16:12.710
we did they are influence plot or the&nbsp;
outlier plot, and added variable plot.

00:16:12.710 --> 00:16:15.950
And now we have to decide what&nbsp;
do we want to do with the model.

00:16:15.950 --> 00:16:22.250
And some ideas that we could try is to use&nbsp;
heteroskedasticity robust standard errors,&nbsp;

00:16:22.250 --> 00:16:23.990
our sample size is so small,&nbsp;

00:16:23.990 --> 00:16:28.670
and there is no clear evidence of&nbsp;
serious heteroskedasticity problem.

00:16:28.670 --> 00:16:34.670
So in this case, I would probably use&nbsp;
the normal conventional standard errors,&nbsp;

00:16:34.670 --> 00:16:39.260
consider dropping general managers&nbsp;
and see if the results change.

00:16:39.260 --> 00:16:42.710
Even if we decide to keep general managers in our&nbsp;&nbsp;

00:16:42.710 --> 00:16:45.860
sample, that could work as a&nbsp;
robustness check in the paper.

00:16:45.860 --> 00:16:49.490
So in the Deephouse's paper,
they estimated the same model&nbsp;&nbsp;

00:16:49.490 --> 00:16:53.630
with the one outlier observation and without&nbsp;
the outlier and then compare the results.

00:16:53.630 --> 00:16:57.230
And we should consider log&nbsp;
transformation of income,

00:16:57.230 --> 00:17:02.270
consider the income in relative&nbsp;
terms makes a lot more sense anyway.

00:17:02.270 --> 00:17:07.430
Because when you think of all races, for example,
or you want to switch to a new job,&nbsp;

00:17:07.430 --> 00:17:13.160
then you typically want to negotiate a&nbsp;
salary increase relative your current level.

00:17:13.160 --> 00:17:15.170
Also additional salary,&nbsp;

00:17:15.170 --> 00:17:22.415
how much it increases your quality of&nbsp;
life depends on the current salary level.

00:17:22.415 --> 00:17:28.940
So if you give 1000 euros to somebody who makes&nbsp;
1000 euros per month, that's a big difference.

00:17:28.940 --> 00:17:33.710
If you give 1000 euros to somebody&nbsp;
who makes 5000 euros a month,&nbsp;

00:17:33.710 --> 00:17:35.300
it's a smaller difference.

00:17:35.300 --> 00:17:38.810
So income company revenues,&nbsp;

00:17:38.810 --> 00:17:42.590
that kind of quantities we typically&nbsp;
want to consider in relative terms.

00:17:42.590 --> 00:17:45.200
And to do that we use the log transformers.