WEBVTT
Kind: captions
Language: en
00:00:00.030 --> 00:00:03.270
We will now take a look at the
interpretation regression coefficients.
00:00:03.270 --> 00:00:05.940
And the actual interpretation
of what the results mean
00:00:05.940 --> 00:00:09.870
is a more difficult part than
the calculation of the results.
00:00:09.870 --> 00:00:13.500
So, whenever you run a regression analysis,
00:00:14.280 --> 00:00:17.520
the regression coefficients
beta have to be interpreted,
00:00:17.520 --> 00:00:21.570
because the readers of your
research article don't know
00:00:21.570 --> 00:00:22.650
what the betas mean,
00:00:22.650 --> 00:00:23.700
so you have to tell them.
00:00:23.700 --> 00:00:28.620
And there are also other ways that the
regression analysis can be quantified.
00:00:28.620 --> 00:00:31.440
So regression analysis tells us
00:00:31.440 --> 00:00:34.140
what is the direction of an effect and,
00:00:34.140 --> 00:00:37.650
whether an effect is
statistically significant or not.
00:00:37.650 --> 00:00:41.070
What we want to know however is,
00:00:41.070 --> 00:00:43.020
whether the effects are large or not.
00:00:43.020 --> 00:00:44.670
And that depends on the interpretation.
00:00:44.670 --> 00:00:49.710
In some context regression
coefficient of 10 is very large,
00:00:49.710 --> 00:00:53.520
in other contexts, a regression
coefficient of 10 is very small.
00:00:53.520 --> 00:00:56.670
So you have to consider the context and also,
00:00:56.670 --> 00:00:58.410
what are the variables involved?
00:00:58.410 --> 00:01:04.080
One of the easiest ways to start
interpreting regression analysis is
00:01:04.080 --> 00:01:05.400
to look at the R-squared statistic.
00:01:05.400 --> 00:01:09.690
So the R-squared statistic is calculated
based on the regression results and
00:01:09.690 --> 00:01:14.400
is typically presented here on the
bottom of the regression analysis table.
00:01:14.400 --> 00:01:17.280
Another related statistic
is the adjusted R-squared.
00:01:17.280 --> 00:01:19.710
The R-squared statistic tells us,
00:01:19.710 --> 00:01:21.540
how much the variables,
00:01:21.540 --> 00:01:22.800
the independent variables together
00:01:22.800 --> 00:01:24.900
explain the dependent variable.
00:01:24.900 --> 00:01:28.260
And it's an estimate of the
quality of the model in some sense,
00:01:28.260 --> 00:01:31.620
sometimes it is referred to as
goodness of fit of a regression model,
00:01:31.620 --> 00:01:33.930
or as a coefficient of determination.
00:01:34.650 --> 00:01:37.590
Most people just refer it to us an R-squared.
00:01:37.590 --> 00:01:41.490
So the R-squared varies between 0 and 1.
00:01:41.490 --> 00:01:46.230
0 means that the independent variables
don't explain the dependent variable at all,
00:01:46.230 --> 00:01:52.140
1 means that the independent variables
completely explain the dependent values.
00:01:52.140 --> 00:01:55.500
One problem with R-squared
is that it always goes up
00:01:55.500 --> 00:01:57.240
when you add variables to the model.
00:01:57.240 --> 00:02:01.440
So when your number of
variables starts to increase
00:02:01.440 --> 00:02:03.690
toward the number of observations,
00:02:03.690 --> 00:02:09.330
for example, if you fit a model with
99 variables to 100 observations,
00:02:09.330 --> 00:02:11.580
the R-squared will be exactly 1.
00:02:11.580 --> 00:02:15.970
So it always increases and goes
up and it's positively biased.
00:02:15.970 --> 00:02:18.760
The bias here means that,
00:02:18.760 --> 00:02:22.060
if we calculate the regression
analysis using sample data,
00:02:22.060 --> 00:02:25.450
the results can be expected to be larger than
00:02:25.450 --> 00:02:29.350
if we run the same regression
analysis on the full population.
00:02:29.350 --> 00:02:32.680
Because the R-squared is positively biased,
00:02:32.680 --> 00:02:36.940
we have introduced the
adjusted R-squared statistic,
00:02:36.940 --> 00:02:39.820
which penalizes complex models.
00:02:39.820 --> 00:02:42.310
So when your R-squared goes up,
00:02:42.310 --> 00:02:44.740
just because you have too
many variables in the model,
00:02:44.740 --> 00:02:48.460
then adjusted r-square adjusts the r-squared down
00:02:48.460 --> 00:02:51.070
to compensate for that bias.
00:02:51.070 --> 00:02:54.070
So it calculates an adjusted value and
00:02:54.070 --> 00:02:56.620
the adjustment is based on
the number of observations
00:02:56.620 --> 00:02:58.090
and the sample size.
00:02:58.090 --> 00:03:00.550
When the sample size is large and
00:03:00.550 --> 00:03:03.640
you have a very small number of variables,
00:03:03.640 --> 00:03:08.920
for example, if you have five independent
variables and 500 observations,
00:03:08.920 --> 00:03:11.590
you have 100 observations for
each independent variable,
00:03:11.590 --> 00:03:13.360
the adjustment is very small.
00:03:13.360 --> 00:03:21.070
If you have, let's say 25 observations
and 100 units in your sample,
00:03:21.070 --> 00:03:22.990
then the adjustment is pretty large,
00:03:22.990 --> 00:03:27.160
because you have only four observations
for each independent variable.
00:03:27.160 --> 00:03:30.070
One problem is that
00:03:30.070 --> 00:03:32.140
the adjusted r-square is not unbiased either,
00:03:32.140 --> 00:03:38.170
but it can be expected to be less
biased than the actual R-squared.
00:03:38.170 --> 00:03:42.670
To actually get an unbiased estimate of the
population R-squared is quite difficult,
00:03:42.670 --> 00:03:44.200
so we don't normally do that.
00:03:44.200 --> 00:03:50.380
The R-squared tells us whether the
model explains the data at all,
00:03:50.380 --> 00:03:53.740
so when R-squared is 0 then
it's the end of interpretation,
00:03:53.740 --> 00:03:58.090
the variables, the independent variables
don't explain the dependent variable at all.
00:03:58.090 --> 00:04:00.040
Then the question is,
00:04:00.040 --> 00:04:02.620
how much is a meaningful explanation?
00:04:02.620 --> 00:04:05.500
If you explain 1 % of a phenomenon,
00:04:05.500 --> 00:04:07.480
in some context that is meaningful,
00:04:07.480 --> 00:04:09.400
in other contexts, it's not meaningful.
00:04:09.400 --> 00:04:13.120
The behavior of people and the
performance of organizations,
00:04:13.120 --> 00:04:15.430
it's very difficult to predict or explain,
00:04:15.430 --> 00:04:18.340
because it depends on so many different things.
00:04:18.340 --> 00:04:20.710
And therefore in social sciences
00:04:20.710 --> 00:04:26.230
the R-squared typically vary
in the 10-20-30 % ballpark.
00:04:26.230 --> 00:04:27.730
So if you have a 30 percent R-squared,
00:04:27.730 --> 00:04:29.260
then you have a pretty good explanation,
00:04:29.260 --> 00:04:31.780
or you could also have a flawed study,
00:04:31.780 --> 00:04:33.400
but we'll talk about that a bit later.
00:04:33.400 --> 00:04:37.090
So you have to consider the context.
00:04:37.090 --> 00:04:44.080
In natural sciences R-squared of 99
percent could be considered not large enough.
00:04:44.080 --> 00:04:48.550
So R-squared is useful for the first check of
00:04:48.550 --> 00:04:52.150
whether the interpretation of
the results further makes sense.
00:04:52.150 --> 00:04:53.320
If R-squared is too small,
00:04:53.320 --> 00:04:56.830
then we know that none of these
variables in the model actually matter
00:04:56.830 --> 00:04:58.540
for the dependent variable.
00:04:58.540 --> 00:05:01.900
So interpreting the effects of each
independent variable separately
00:05:01.900 --> 00:05:03.370
is a waste of time.
00:05:05.530 --> 00:05:11.140
Also, the R-squared offers us
an intuitive way of explaining
00:05:11.140 --> 00:05:12.850
whether the results are larger not.
00:05:12.850 --> 00:05:17.530
If I tell you that choosing,
00:05:17.530 --> 00:05:20.830
the choice between three
investment strategies for example
00:05:20.830 --> 00:05:26.230
explains 30% of the variation
of your investment profits,
00:05:26.230 --> 00:05:29.290
then that's a big deal.
00:05:29.290 --> 00:05:32.560
We understand that 30% is
a big deal in that context.
00:05:32.560 --> 00:05:35.290
So because R-squared can be
understood in percentages,
00:05:35.290 --> 00:05:39.070
it has a natural interpretation for most people.
00:05:39.070 --> 00:05:43.600
We'll take a look at how Hekman
uses the R-squared in his paper.
00:05:43.600 --> 00:05:45.910
So Hekman doesn't really interpret,
00:05:45.910 --> 00:05:48.970
what the actual regression
coefficients in their study mean.
00:05:48.970 --> 00:05:52.240
But they are basing their
interpretation of the magnitude
00:05:52.240 --> 00:05:53.980
of the effects on the R-squared.
00:05:53.980 --> 00:05:58.330
And they're saying that between
their control variables only model,
00:05:58.330 --> 00:06:00.130
and the variables,
00:06:00.130 --> 00:06:06.940
the model where there were
the gender and race variables,
00:06:06.940 --> 00:06:10.060
the R-squared increases between 15 to 20 percent.
00:06:10.060 --> 00:06:12.670
That can be interpreted to mean that,
00:06:12.670 --> 00:06:19.420
the effects of race and gender
are in the ballpark of 15 to 24 %,
00:06:19.420 --> 00:06:22.390
assuming that there's no bias in R-squared,
00:06:22.390 --> 00:06:23.110
which is not true.
00:06:23.110 --> 00:06:26.440
So they should really be looking at
the adjusted R-squared in this case.
00:06:26.440 --> 00:06:28.720
But everyone understands that
00:06:28.720 --> 00:06:33.130
if we say that the customer
satisfaction score's variation,
00:06:33.130 --> 00:06:37.630
one-fourth of that is
explained by gender and race.
00:06:37.630 --> 00:06:40.330
Everyone understands that that's a big deal,
00:06:40.330 --> 00:06:43.540
everyone who understand percentages.
00:06:43.540 --> 00:06:46.060
It provides us with an easy way of saying
00:06:46.060 --> 00:06:50.980
whether the results are of any practical meaning.
00:06:50.980 --> 00:06:54.640
When you have looked at the R-squared,
00:06:54.640 --> 00:06:56.500
the next thing that we want to know is,
00:06:56.500 --> 00:06:59.590
which of the individual variables matters.
00:06:59.590 --> 00:07:04.630
And that's where we get to the interpretation
of the regression coefficients.
00:07:04.630 --> 00:07:08.200
Let's take a look at the Talouselämää 500 example.
00:07:08.200 --> 00:07:10.540
So we have a sample
00:07:10.540 --> 00:07:15.280
where the women-led companies
are 4.7 percentage points
00:07:15.280 --> 00:07:17.470
more profitable than man-led companies.
00:07:17.470 --> 00:07:19.300
And that's a big difference in ROA.
00:07:19.300 --> 00:07:22.750
We want to know whether the
difference is caused by a woman
00:07:22.750 --> 00:07:27.010
or whether it's caused by some third factor.
00:07:27.010 --> 00:07:29.770
So we have to present
alternative competing hypotheses.
00:07:29.770 --> 00:07:32.410
One competing hypothesis is
00:07:32.410 --> 00:07:35.020
that it is not an effect of CEO gender,
00:07:35.020 --> 00:07:37.030
instead, it's an effect,
00:07:37.030 --> 00:07:40.660
it's a spurious correlation
caused by firm revenue.
00:07:40.660 --> 00:07:44.500
So that smaller companies are
more likely to hire women,
00:07:44.500 --> 00:07:46.840
and smaller companies are also more profitable.
00:07:46.840 --> 00:07:50.470
Another competing hypothesis,
00:07:50.470 --> 00:07:52.060
or second competing hypotheses is
00:07:52.060 --> 00:07:53.500
that this is an industry difference.
00:07:53.500 --> 00:07:58.600
For example, manufacturing companies
are less profitable in ROA metric,
00:07:58.600 --> 00:08:00.910
because ROA depends on assets
00:08:00.910 --> 00:08:03.580
and these companies tend to have
more assets than service companies,
00:08:03.580 --> 00:08:09.790
and manufacturing companies are more
likely to hire male CEOs than women CEOs.
00:08:09.790 --> 00:08:11.380
So we have the other variable here.
00:08:11.380 --> 00:08:13.690
Now regression analysis tells us,
00:08:13.690 --> 00:08:17.980
what is the effect of CEO gender ceteris paribus,
00:08:17.980 --> 00:08:22.930
which is an economics term for
holding other variables constant.
00:08:22.930 --> 00:08:27.880
So when the CEO gender changes from
00:08:27.880 --> 00:08:30.700
zero indicating man to one indicating a woman,
00:08:30.700 --> 00:08:33.790
what is the expected increase in return on assets.
00:08:33.790 --> 00:08:38.800
Holding things constant means
00:08:38.800 --> 00:08:43.150
that you are comparing two cases
that are exactly comparable
00:08:43.150 --> 00:08:44.830
on the other variables.
00:08:44.830 --> 00:08:46.810
So if we have two companies
00:08:46.810 --> 00:08:49.570
that are of the same size and same industry,
00:08:49.570 --> 00:08:55.180
then woman-led companies on
average, beta 1 more profitable.
00:08:55.180 --> 00:08:58.180
So the regression coefficient directly tells us,
00:08:58.180 --> 00:09:00.310
what is the profitability difference.
00:09:00.310 --> 00:09:05.560
If it's 1 percentage points, 2
percentage points or 3 percentage points,
00:09:05.560 --> 00:09:07.780
then it's up to us to interpret,
00:09:07.780 --> 00:09:09.460
whether it's a big effect or not.
00:09:09.460 --> 00:09:12.490
We know that 4.7 percentage
points is a big difference,
00:09:12.490 --> 00:09:16.360
one point, probably not so big difference.
00:09:16.360 --> 00:09:22.780
Okay so interpreting regression
coefficients is relatively straightforward
00:09:22.780 --> 00:09:25.510
when these variables have a meaningful unit.
00:09:25.510 --> 00:09:29.740
So we know that ROA has a
meaningful unit for managers.
00:09:30.760 --> 00:09:35.680
Everyone, if we said to a manager
that my company's ROA is 20%,
00:09:35.680 --> 00:09:38.830
they know that it's pretty
good for most industries.
00:09:39.400 --> 00:09:42.760
And we also know that the CEO is female,
00:09:42.760 --> 00:09:44.110
1 it's a woman,
00:09:44.110 --> 00:09:44.980
0 it's a man,
00:09:44.980 --> 00:09:46.390
so it has some meaning for us.
00:09:46.390 --> 00:09:50.950
Sometimes we have units that
don't really have any meanings,
00:09:50.950 --> 00:09:54.670
and that complicates the interpretation.
00:09:54.670 --> 00:09:56.650
So let's take a look at this question.
00:09:56.650 --> 00:10:01.330
Does one unit increase in education,
00:10:01.330 --> 00:10:02.200
does it pay off?
00:10:02.200 --> 00:10:05.920
We have a statemen, a regression result,
00:10:05.920 --> 00:10:10.510
that one unit increase in education
leads to one unit increase in salary.
00:10:10.510 --> 00:10:12.310
Is it a big deal?
00:10:12.310 --> 00:10:14.710
We would need to know,
00:10:14.710 --> 00:10:17.230
what is the unit of education,
00:10:17.230 --> 00:10:18.490
what is the unit in salary?
00:10:18.490 --> 00:10:21.790
Let's say that the unit is education in years,
00:10:21.790 --> 00:10:25.690
and salary is euros per year.
00:10:25.690 --> 00:10:31.990
So we say one year increase in education leads to
00:10:31.990 --> 00:10:33.760
one year increase in annual salary.
00:10:33.760 --> 00:10:34.780
Does it make a difference?
00:10:34.780 --> 00:10:38.410
I would think not, for most people.
00:10:38.410 --> 00:10:40.090
Pretty much every people,
00:10:40.090 --> 00:10:41.620
no one really wants to go to school
00:10:41.620 --> 00:10:44.440
if you just get one additional
euro of income per year.
00:10:44.440 --> 00:10:47.800
So that way, it's not meaningful.
00:10:47.800 --> 00:10:52.720
How about 1-year increase leads to a
1000 year increase in annual salary?
00:10:52.720 --> 00:10:56.440
That's a more problematic question.
00:10:56.440 --> 00:10:58.750
If we consider Finland,
00:10:58.750 --> 00:11:04.420
where salaries annually are
in tens of thousands of euros,
00:11:04.420 --> 00:11:06.580
maybe in the lower end,
00:11:06.580 --> 00:11:10.420
if you make 20 thousand per year,
00:11:10.420 --> 00:11:14.950
maybe 1000 is worth one year
of education, maybe not,
00:11:14.950 --> 00:11:16.060
depends on, its 5%,
00:11:16.060 --> 00:11:19.240
depends on how much you like to go to school.
00:11:19.240 --> 00:11:21.370
On the other hand,
00:11:21.370 --> 00:11:24.280
if this data were from a developing country,
00:11:24.280 --> 00:11:30.370
where the annual salaries are in the
thousand, two thousand euro ballpark.
00:11:30.370 --> 00:11:34.420
Then one euro increase in the
annual salary is a big deal,
00:11:34.420 --> 00:11:39.190
you can double your income
basically in some cases,
00:11:39.190 --> 00:11:42.010
if you go to one additional year of school.
00:11:42.010 --> 00:11:45.430
And that's a big thing for those people.
00:11:45.430 --> 00:11:46.900
So you have to think of,
00:11:46.900 --> 00:11:47.770
what are the units,
00:11:47.770 --> 00:11:50.260
what's the unit of the independent variable,
00:11:50.260 --> 00:11:52.270
what's the unit of the dependent variable and
00:11:52.270 --> 00:11:54.970
what is the context that you're
evaluating the effect in?
00:11:54.970 --> 00:12:01.150
What if we say that one year increase leads
to one Bitcoin increase in annual salary?
00:12:01.150 --> 00:12:03.940
So we get one additional year of education,
00:12:03.940 --> 00:12:06.790
and we get one Bitcoin per year more.
00:12:06.790 --> 00:12:10.690
Well, that's more problematic,
00:12:10.690 --> 00:12:15.910
because people don't have an
intuitive understanding of
00:12:15.910 --> 00:12:17.170
what is the value of Bitcoin?
00:12:17.170 --> 00:12:19.750
So obviously when you say someone,
00:12:19.750 --> 00:12:21.100
tell somebody that
00:12:21.100 --> 00:12:22.030
I'll give you a Bitcoin.
00:12:22.030 --> 00:12:24.370
Then the first question they'll ask,
00:12:24.370 --> 00:12:26.290
what's the value of Bitcoin in Euros?
00:12:26.290 --> 00:12:30.100
So, in this case, we could convert
the value of Bitcoin to Euro,
00:12:30.100 --> 00:12:31.300
so we can do a conversion
00:12:31.300 --> 00:12:37.180
and express the regression coefficient
in a way that's more understandable.
00:12:37.180 --> 00:12:39.880
Let's say that one year increase leads to
00:12:39.880 --> 00:12:42.460
three thousand increase in annual salary.
00:12:42.460 --> 00:12:46.090
I don't know what is the value of Bitcoin now
00:12:46.090 --> 00:12:47.920
but let's assume it's three thousand euros,
00:12:47.920 --> 00:12:51.280
so then we know that it's probably
a big deal for some people.
00:12:51.280 --> 00:12:55.210
So sometimes we can convert the units
to something that we can understand,
00:12:55.210 --> 00:12:58.990
even if the original unit was something
that we don't understand easily.
00:12:58.990 --> 00:13:05.110
What if we have a case of a
unit that cannot be converted?
00:13:05.110 --> 00:13:11.980
So let's say that, result is
that one year increase leads
00:13:11.980 --> 00:13:14.140
to one Buckazoid increase in annual salary.
00:13:14.140 --> 00:13:18.010
Buckazoid is a fictional
currency in a computer game,
00:13:18.010 --> 00:13:22.420
and I don't think that anyone has ever developed
00:13:22.420 --> 00:13:24.460
an exchange rate from Buckazoids to euros.
00:13:24.460 --> 00:13:28.450
So we can't convert this effect
into euros, so what do we do?
00:13:28.450 --> 00:13:35.590
One way of dealing with this Buckazoid issue is
00:13:35.590 --> 00:13:37.960
that we have to first understand,
00:13:37.960 --> 00:13:41.980
what's the average salary in Buckazoids,
00:13:41.980 --> 00:13:43.240
in this fictional universe.
00:13:43.240 --> 00:13:48.430
And also how much are the salaries dispersed.
00:13:48.430 --> 00:13:50.950
If we say that I'll give you ten Buckazoids
00:13:50.950 --> 00:13:52.990
or I'll give you a million Buckazoids,
00:13:52.990 --> 00:13:54.250
it doesn't really make sense
00:13:54.250 --> 00:13:57.850
unless we know, what's the mean income.
00:13:57.850 --> 00:14:02.440
If we know that the mean income in
that fictional world is ten Buckazoids,
00:14:02.440 --> 00:14:05.980
if we tell somebody that you'll
get a million Buckazoids,
00:14:05.980 --> 00:14:07.900
then a million Buckazoids is probably a lot.
00:14:07.900 --> 00:14:12.790
If we tell them that we give
you a million Buckazoids
00:14:12.790 --> 00:14:15.310
and the annual income is a billion Buckazoids,
00:14:15.310 --> 00:14:17.770
then not a big deal, as much.
00:14:17.770 --> 00:14:22.210
To understand how the variable varies,
00:14:22.210 --> 00:14:24.670
we have to look at its mean
and standard deviations.
00:14:24.670 --> 00:14:28.270
And it's useful in this case
when we have these variables
00:14:28.270 --> 00:14:29.530
that don't have any units,
00:14:29.530 --> 00:14:32.890
any naturally interpretable units,
00:14:32.890 --> 00:14:35.080
look at okay how is it distributed.
00:14:35.080 --> 00:14:37.720
So we take a look at mean and standard deviation.
00:14:37.720 --> 00:14:40.510
Let's assume that in our sample
00:14:40.510 --> 00:14:42.670
the income in Buckazoids is distributed normally.
00:14:42.670 --> 00:14:48.250
A normal distribution implies
that one standard deviation,
00:14:48.250 --> 00:14:50.440
two standard deviations from the mean,
00:14:50.440 --> 00:14:54.940
have special interpretation.
00:14:54.940 --> 00:14:56.770
So in normal distribution,
00:14:56.770 --> 00:15:03.400
68 % of observations are plus or minus
one standard deviation above the mean.
00:15:03.400 --> 00:15:09.730
So if we say that our income is one
standard deviation above the mean,
00:15:09.730 --> 00:15:13.750
then we know that we are solidly
in the high-income segments,
00:15:13.750 --> 00:15:17.230
so we are pretty well above the average.
00:15:17.230 --> 00:15:22.450
If we say that our income is two standard
deviations, in Buckazoids, above the mean,
00:15:22.450 --> 00:15:30.040
then we know that we are in the top
2.5 % of the income distribution.
00:15:30.040 --> 00:15:33.670
We can also see that generally the effect of
00:15:34.270 --> 00:15:38.320
one standard deviation increase is pretty big.
00:15:38.320 --> 00:15:41.170
So you're solidly here below mean,
00:15:41.170 --> 00:15:43.300
one standard deviation takes you to the average.
00:15:43.300 --> 00:15:46.360
Then two standard deviations, you are pretty rich,
00:15:46.360 --> 00:15:49.420
so you are in the top 2.5 %.
00:15:49.420 --> 00:15:55.510
So standard deviation units can be useful
for interpreting regression analysis results.
00:15:55.510 --> 00:16:01.060
So if we say that one additional
year of education increases
00:16:01.060 --> 00:16:07.600
your income by one standard
deviation in the Buckazoid units,
00:16:07.600 --> 00:16:09.340
is it a large effect?
00:16:09.340 --> 00:16:11.920
Well, then we would have to,
00:16:11.920 --> 00:16:12.880
for people it is,
00:16:12.880 --> 00:16:14.200
but we would have to think,
00:16:14.200 --> 00:16:16.210
what is the lifespan of these aliens?
00:16:16.210 --> 00:16:19.060
If they only live on average one year,
00:16:19.060 --> 00:16:23.110
then a one-year investment in
education is a huge deal for them.
00:16:23.110 --> 00:16:24.880
So we have to think about the context again.
00:16:24.880 --> 00:16:28.750
Let's take a look at an empirical example.
00:16:28.750 --> 00:16:30.580
So this is the Deephouses paper,
00:16:30.580 --> 00:16:33.760
and table two and model two
from the regression results.
00:16:33.760 --> 00:16:38.170
And we'll be interpreting these
purely through standard deviations.
00:16:38.710 --> 00:16:43.300
The dependent variable ROA has a meaningful unit,
00:16:43.300 --> 00:16:44.860
but we'll just ignore it for now.
00:16:44.860 --> 00:16:47.020
So we'll just be looking at standard deviations.
00:16:47.020 --> 00:16:51.070
Their regression coefficient was -0.02
00:16:51.070 --> 00:16:55.390
for the effect of strategic deviation
or relative return on assets.
00:16:55.390 --> 00:16:57.220
So is it a big effect?
00:16:57.220 --> 00:16:59.050
To understand that,
00:16:59.050 --> 00:17:00.760
we would need to understand
00:17:00.760 --> 00:17:03.130
what is the unit of strategic deviation,
00:17:03.130 --> 00:17:05.590
that's a completely made-up number by them,
00:17:05.590 --> 00:17:06.580
so it doesn't have a meaning,
00:17:06.580 --> 00:17:09.790
ROA has a meaning, but we'll
just ignore it for now.
00:17:09.790 --> 00:17:11.260
We need to know,
00:17:11.260 --> 00:17:13.840
what are the standard
deviation of these variables?
00:17:13.840 --> 00:17:17.200
So the standard deviation of ROA is 0.7
00:17:17.200 --> 00:17:21.880
and a standard deviation of
strategic deviation is 2.9.
00:17:21.880 --> 00:17:23.350
That tells us that,
00:17:23.350 --> 00:17:27.610
if the data are normally distributed,
00:17:27.610 --> 00:17:38.020
then 95% of the observations of
ROA are plus or minus 1.4 units,
00:17:38.020 --> 00:17:40.360
that's two standard deviations from the mean.
00:17:40.360 --> 00:17:47.290
The difference between the top 2.5 %
and bottom 2.5 % is then 2.8 units.
00:17:47.290 --> 00:17:51.380
So top 2.5, bottom 2.5, four standard deviations,
00:17:51.380 --> 00:17:54.860
it's a 2.8 units.
00:17:54.860 --> 00:17:58.640
So what is the effect of strategic deviation?
00:17:58.640 --> 00:18:04.430
the strategic deviation, one standard
deviation increase of strategic deviation is
00:18:04.430 --> 00:18:12.110
then 2.932 multiplied by -0.020,
00:18:12.110 --> 00:18:16.910
which equals -0,058 decrease in relative ROA.
00:18:16.910 --> 00:18:20.480
Then we compare is this -0.058,
00:18:20.480 --> 00:18:23.780
is it larger than the 2.8 units?
00:18:23.780 --> 00:18:25.790
So the full-scale is from the -2.5 % to the,
00:18:25.790 --> 00:18:33.440
from the worst 2.5 % to the
best 2.5 % is 2.8 units,
00:18:33.440 --> 00:18:38.690
and if you increase your strategic
deviation by one standard deviation,
00:18:38.690 --> 00:18:44.540
you get -0.058 decrease in ROA.
00:18:44.540 --> 00:18:47.000
So it's a smallish effect.
00:18:47.000 --> 00:18:52.370
We can also understand the
effects of interpretation and
00:18:52.370 --> 00:18:57.080
how it's reported while looking at
this a nice example about Sauna.
00:18:57.080 --> 00:19:01.730
So when we ask whether the sauna is warm or not,
00:19:01.730 --> 00:19:03.890
sauna is a Finnish thing.
00:19:03.890 --> 00:19:08.720
A normal research paper would say that
00:19:08.720 --> 00:19:12.110
the temperature of the sauna
is statistically significantly
00:19:12.110 --> 00:19:13.130
different from normal room temperature.
00:19:15.950 --> 00:19:17.450
It tells us that maybe
00:19:17.450 --> 00:19:19.340
the sauna is heating,
00:19:19.340 --> 00:19:21.800
maybe it's ready for going in,
00:19:21.800 --> 00:19:22.820
maybe it's too hot,
00:19:22.820 --> 00:19:25.850
maybe it was on a day before
and it's still cooling.
00:19:25.850 --> 00:19:30.110
It doesn't really tell us anything
about whether the sauna is warm or not.
00:19:30.110 --> 00:19:33.890
And that's equivalent of saying that
00:19:33.890 --> 00:19:35.690
the effect of strategic deviation ROA
00:19:35.690 --> 00:19:39.440
is negatively and statistically
significantly different from zero.
00:19:39.440 --> 00:19:42.050
So the statistical significance just tells that
00:19:42.770 --> 00:19:43.730
there is some effect,
00:19:43.730 --> 00:19:47.000
it doesn't tell us whether
the effect is large not.
00:19:47.000 --> 00:19:52.040
Then even better, a slightly better answer is that
00:19:52.040 --> 00:19:54.590
the temperature of the sauna
is currently 80 degrees
00:19:54.590 --> 00:20:02.000
and comparable that the effect of
strategic deviation of ROA is -0.020.
00:20:02.000 --> 00:20:05.970
So that is useful for people who understand
00:20:05.970 --> 00:20:10.350
what 80 degrees mean and what this - 0.020 means.
00:20:10.350 --> 00:20:13.890
So for most people, who go to sauna often,
00:20:13.890 --> 00:20:15.420
know what 80 centigrades mean,
00:20:15.420 --> 00:20:18.990
but you can't assume that the
readers of your research study
00:20:18.990 --> 00:20:20.370
will understand your units,
00:20:20.370 --> 00:20:21.960
so you have to explain what it means.
00:20:21.960 --> 00:20:25.830
So a really good answer to
whether the sauna is hot is
00:20:25.830 --> 00:20:28.050
to say that the temperature is currently 80
00:20:28.050 --> 00:20:31.200
and then tell that most people
who go to the sauna regularly
00:20:31.200 --> 00:20:34.110
would say that the sauna is too
hot but they could still do it.
00:20:34.110 --> 00:20:37.080
So that quantifies that the sauna is pretty hot,
00:20:37.080 --> 00:20:41.190
more so than just saying that it's 80 centigrades.
00:20:41.190 --> 00:20:42.510
The same thing,
00:20:42.510 --> 00:20:46.140
you can say that the effect of ROA is -0.20,
00:20:46.140 --> 00:20:50.520
and the difference between ROAs
of top 25 % and bottom 25 %,
00:20:50.520 --> 00:20:56.130
for standard deviation is -12,
00:20:56.130 --> 00:21:01.740
so if you go from the least deviant
to the most deviant is 0.12,
00:21:01.740 --> 00:21:06.540
and the same scale for the ROA is 2.8,
00:21:06.540 --> 00:21:12.990
so we can see that 0.12 is
pretty small compared to 2.8,
00:21:12.990 --> 00:21:14.790
so the effect is quite small.
00:21:14.790 --> 00:21:20.880
There are other things that you can
do to improve your profitability,
00:21:20.880 --> 00:21:23.940
than to be more statistically deviant.
00:21:23.940 --> 00:21:29.340
Let's take a look at yet another example,
00:21:29.340 --> 00:21:32.010
so this is from Hekman's paper.
00:21:32.010 --> 00:21:35.820
And Hekman's paper shows a regression table and
00:21:35.820 --> 00:21:42.540
now, these effects are the
number of patients in a panel,
00:21:42.540 --> 00:21:46.620
so how many people go to see a doctor is -0.04,
00:21:46.620 --> 00:21:53.100
and the age of the doctor is
-0.13, the regression coefficients.
00:21:53.100 --> 00:21:54.780
Are these large effects or not?
00:21:54.780 --> 00:22:01.350
We would have to look at the correlation
table and standard deviations
00:22:01.350 --> 00:22:02.400
and means to understand
00:22:02.400 --> 00:22:04.710
what are these large effects in a normal case?
00:22:04.710 --> 00:22:08.670
But this is actually not a normal case,
00:22:08.670 --> 00:22:11.850
because these are standardized
regression coefficients.
00:22:11.850 --> 00:22:12.720
They don't report it,
00:22:12.720 --> 00:22:14.790
but you can see it by comparing,
00:22:14.790 --> 00:22:19.200
if you start to interpret this effect
of the number of patients in the panel,
00:22:19.200 --> 00:22:20.250
which is in the thousands,
00:22:20.250 --> 00:22:22.260
and age, which is in the tens.
00:22:22.260 --> 00:22:25.380
You can see that the effect
sizes don't make any sense.
00:22:25.380 --> 00:22:30.930
Also, all these effects are
varied between plus or minus 1,
00:22:30.930 --> 00:22:35.010
which is the typical range for a
standardized regression coefficient.
00:22:35.010 --> 00:22:36.480
They can be more or less,
00:22:36.480 --> 00:22:40.830
but they are typically zero point
something or minus zero point something.
00:22:40.830 --> 00:22:44.640
So these are standardized coefficients,
00:22:44.640 --> 00:22:47.940
which means that the data have been standardized.
00:22:47.940 --> 00:22:51.870
So every variable has a standard
deviation of 1 and a mean of 0,
00:22:51.870 --> 00:22:53.250
before regression estimation.
00:22:53.250 --> 00:22:58.710
In that case, we estimate this
directly as standard deviations.
00:22:58.710 --> 00:23:02.640
One unit increase in physician
productivity is associated with
00:23:02.640 --> 00:23:04.260
beta 1 increase in patient satisfaction.
00:23:04.260 --> 00:23:06.780
So we say that these are,
00:23:06.780 --> 00:23:09.777
one standard deviation increase
in physician productivity
00:23:09.777 --> 00:23:15.030
is associated with one standardized
increase in satisfaction.
00:23:15.030 --> 00:23:19.170
So we interpret directly as standard deviations.
00:23:19.170 --> 00:23:21.480
This looks like the way to do it always,
00:23:21.480 --> 00:23:26.160
so it would simplify life to
always use standardized estimates,
00:23:26.160 --> 00:23:27.900
but that's actually not the case.
00:23:27.900 --> 00:23:33.180
I recommend that you never standardize
a variable that has a meaningful scale.
00:23:33.180 --> 00:23:39.450
So if you have euros or years or something
that makes sense to people as a unit,
00:23:39.450 --> 00:23:40.560
then don't standardize.
00:23:40.560 --> 00:23:42.480
The reason for that is that,
00:23:42.480 --> 00:23:48.180
standardized estimates depend
on the scale of the variables,
00:23:48.180 --> 00:23:51.300
because the standard deviation
is a sample standard deviation.
00:23:51.300 --> 00:23:53.190
So let's say that here
00:23:53.190 --> 00:23:57.210
the standard deviation of age is 6.58
00:23:57.210 --> 00:23:59.340
and the mean is 50.34,
00:24:00.120 --> 00:24:01.920
so the doctors are quite old.
00:24:01.920 --> 00:24:07.380
What would happen if the doctors in this
sample were actually newly graduated,
00:24:07.380 --> 00:24:09.480
between 24 and 28,
00:24:09.480 --> 00:24:11.550
and the standard deviation would be 1?
00:24:11.550 --> 00:24:15.180
What would happen is
00:24:15.180 --> 00:24:18.960
that the standardized regression
coefficient for the same effect
00:24:18.960 --> 00:24:22.710
would be only -0.02,
00:24:22.710 --> 00:24:29.490
which has a very different
interpretation from -0.14.
00:24:29.490 --> 00:24:31.470
So it's 7 times as small,
00:24:31.470 --> 00:24:35.460
it's the exact same effect,
it's just scaled differently.
00:24:35.460 --> 00:24:39.420
So the differential scaling
means that these effects
00:24:39.420 --> 00:24:43.500
0.02 and 0.40 are not comparable,
00:24:43.500 --> 00:24:48.390
so standardization doesn't
make your results comparable.
00:24:48.390 --> 00:24:52.950
So if you can interpret the
results without standardization,
00:24:52.950 --> 00:24:54.630
it is always better to do so.
00:24:54.630 --> 00:24:59.100
So a rule of thumb, use standardization only
00:24:59.100 --> 00:25:04.650
if your variables, none of
them have a natural scale,
00:25:04.650 --> 00:25:08.190
otherwise, interpret the standard deviations units
00:25:08.190 --> 00:25:12.420
only for those variables for which
a natural scale does not exist.