WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:02.640
Regression analysis tells
us the relationship between
00:00:02.640 --> 00:00:06.180
one dependent variable and one
or more independent variable.
00:00:06.973 --> 00:00:09.180
One of the problems with regression analysis,
00:00:09.180 --> 00:00:13.380
or one of the limitations is that it
focuses on linear relationships only.
00:00:13.380 --> 00:00:18.874
However, many relationships in nature
and social life are nonlinear in nature.
00:00:19.281 --> 00:00:24.000
And one very useful technique for dealing
with that kind of relationships is,
00:00:24.000 --> 00:00:26.999
the log transformation or
00:00:26.999 --> 00:00:30.810
logarithm transformation if we
write the log in a long-form.
00:00:31.517 --> 00:00:32.820
So what does that actually do,
00:00:32.820 --> 00:00:34.333
what does log transformation do?
00:00:34.590 --> 00:00:37.380
Many papers contain statements like this.
00:00:37.851 --> 00:00:41.250
We use the log of the revenue since
00:00:41.250 --> 00:00:44.100
revenue for our firms is highly skewed.
00:00:44.100 --> 00:00:46.170
So that's very common,
00:00:46.170 --> 00:00:48.463
the researchers say that something is skewed
00:00:48.463 --> 00:00:52.260
and we take a log of something
to make it more normal.
00:00:52.988 --> 00:00:55.170
That has a couple of issues,
00:00:55.170 --> 00:00:56.550
that kind of statement.
00:00:56.550 --> 00:00:58.500
But let's first take a look at
00:00:58.500 --> 00:01:01.020
what log transformation does to address skewness?
00:01:01.577 --> 00:01:03.257
So these are the data
00:01:03.257 --> 00:01:07.320
from the largest 500 Finnish companies in 2005,
00:01:07.427 --> 00:01:09.167
the revenues for those companies.
00:01:09.167 --> 00:01:12.240
So we have one very large company here,
00:01:12.561 --> 00:01:14.361
then some companies here
00:01:14.361 --> 00:01:18.990
and most are here around a few
hundred million euros of revenue.
00:01:18.990 --> 00:01:22.080
So we have a couple of billion-euro companies,
00:01:22.080 --> 00:01:25.873
and most companies are in the
hundreds of millions range.
00:01:25.873 --> 00:01:28.933
So this distribution is highly skewed,
00:01:28.933 --> 00:01:33.270
it means that there is this long tail here,
00:01:33.270 --> 00:01:36.660
so we have, most observations are clustered here,
00:01:37.453 --> 00:01:41.010
and then we have some that
go to this long tail here.
00:01:41.010 --> 00:01:45.000
This kind of skewed distributions
are sometimes problematic,
00:01:45.000 --> 00:01:46.928
but we have to understand
00:01:46.928 --> 00:01:51.441
that, for example, regression
analysis makes no assumptions about,
00:01:51.441 --> 00:01:54.360
how observed variables are distributed.
00:01:54.360 --> 00:01:57.000
It makes some assumptions but
00:01:57.000 --> 00:02:00.908
the distribution of observed
variables is not one of those.
00:02:00.908 --> 00:02:02.811
If we take a logarithm of this,
00:02:02.811 --> 00:02:04.333
every revenue here,
00:02:04.333 --> 00:02:06.432
we get the distribution that looks like that.
00:02:06.432 --> 00:02:10.230
So we get something that doesn't
have as a long tail as before,
00:02:10.230 --> 00:02:15.150
so now the observations are more
closely clustered around the mean,
00:02:15.150 --> 00:02:17.310
there is still some tail here,
00:02:17.310 --> 00:02:18.360
but not as severe.
00:02:18.724 --> 00:02:22.530
So these units here are now logarithms.
00:02:22.530 --> 00:02:26.400
I'm using the base 10 here
for ease of exposition but
00:02:26.400 --> 00:02:28.440
normally we use the natural logarithm,
00:02:28.440 --> 00:02:30.960
it doesn't really make a
difference for your analysis.
00:02:31.346 --> 00:02:35.006
So this is the 100 million thresholds,
00:02:35.177 --> 00:02:37.127
this is the 1 billion thresholds,
00:02:37.127 --> 00:02:40.350
then we have 10 billion and then
100 billion thresholds here.
00:02:40.800 --> 00:02:46.710
So we change the scale of the
variable by taking a logarithm.
00:02:49.023 --> 00:02:52.384
What the logarithm transformation actually does?
00:02:52.513 --> 00:02:55.063
It changes the shape of the distribution,
00:02:55.063 --> 00:02:56.910
so this is highly skewed,
00:02:57.188 --> 00:02:59.468
this is still skewed but less so.
00:02:59.468 --> 00:03:03.720
So in some cases, it actually
reduces the skewness of data,
00:03:03.720 --> 00:03:06.300
but that's not the reason why we actually use it.
00:03:06.300 --> 00:03:09.690
So we don't need our data to be normal
00:03:09.690 --> 00:03:13.890
but instead sometimes thinking
in terms of relative units
00:03:13.890 --> 00:03:16.140
makes a lot more sense than
00:03:16.140 --> 00:03:18.060
thinking in terms of absolute units.
00:03:18.060 --> 00:03:20.580
So absolute units here means that
00:03:20.580 --> 00:03:29.730
the difference between 0 and 1 billion
is the same as 1 billion and 2 billion.
00:03:30.908 --> 00:03:34.140
So let's think for a while,
00:03:34.140 --> 00:03:37.890
does it make sense to say that
00:03:37.890 --> 00:03:40.500
when a company grows to 0 to 1 billion
00:03:40.650 --> 00:03:43.920
is it the same kind of
transformation for the company
00:03:43.920 --> 00:03:46.140
than when it grows from 1 billion to two billion?
00:03:46.783 --> 00:03:49.273
No, that doesn't make any sense.
00:03:49.273 --> 00:03:51.840
Also, companies nearly don't say
00:03:51.840 --> 00:03:54.240
that we grew this in this many euros,
00:03:54.240 --> 00:04:03.056
instead, we grew by 10% or 15% compared to the previous year's revenue.
00:04:03.056 --> 00:04:08.850
So quite often we like to
compare things in relative terms.
00:04:08.850 --> 00:04:14.370
You get your salary increases
based on labour union negotiations,
00:04:14.584 --> 00:04:16.984
they are hardly ever fixed euro amounts,
00:04:16.984 --> 00:04:19.860
they are 1 % - 2 %,
00:04:19.860 --> 00:04:23.280
something related to your current salary level.
00:04:23.280 --> 00:04:24.810
So they are relative units.
00:04:24.810 --> 00:04:29.764
So here the relative units mean that
the difference between 1 billion,
00:04:32.206 --> 00:04:38.250
or 100 million and 1 billion is relatively
the same than the difference between
00:04:38.250 --> 00:04:40.448
1 billion and 10 billion.
00:04:40.448 --> 00:04:45.660
So each space between these two ticks doesn't
00:04:45.660 --> 00:04:49.290
refer to unit increase,
00:04:49.290 --> 00:04:52.461
instead, it refers to a tenfold increase.
00:04:52.697 --> 00:04:56.657
So things increase relative to the previous level.
00:04:59.634 --> 00:05:01.187
So let's take a look at,
00:05:01.187 --> 00:05:03.587
what it means to run a regression
analysis with log transformation,
00:05:03.587 --> 00:05:05.241
and why would we want to do that?
00:05:06.960 --> 00:05:09.330
Transforming the variables to be less skewed
00:05:09.330 --> 00:05:13.384
is not the right reason to use log transformation
00:05:13.384 --> 00:05:15.870
and if you want to reduce skewness,
00:05:15.870 --> 00:05:18.000
you, of course, can do log transformation,
00:05:18.000 --> 00:05:19.680
but you have to understand that
00:05:19.680 --> 00:05:23.160
there are other more important
reasons to use log transformation
00:05:23.160 --> 00:05:26.516
and it also influences how
you interpret your results.
00:05:27.244 --> 00:05:31.170
So this is the example data
set from the Prestige data set,
00:05:31.170 --> 00:05:36.210
these are occupations from the
Canada census of 1930-70-something.
00:05:36.424 --> 00:05:39.780
And we have prestige score of an occupation
00:05:39.780 --> 00:05:42.150
and then the average income of an occupation.
00:05:42.150 --> 00:05:43.731
We're interested in learning,
00:05:43.731 --> 00:05:46.260
how much income depends on prestige.
00:05:46.967 --> 00:05:49.680
We can see that there is a linear effect here,
00:05:49.680 --> 00:05:53.331
prestige goes from 20 to 80 and
00:05:53.331 --> 00:05:58.890
first income increases and
then it starts to increase
00:05:58.890 --> 00:06:00.381
in a nonlinear fashion.
00:06:00.381 --> 00:06:03.120
So if we were to draw a line or a curve,
00:06:03.120 --> 00:06:07.350
it would first go flat and
then it would curve up a bit.
00:06:08.014 --> 00:06:12.750
So the line here is not the best
description of the data.
00:06:13.136 --> 00:06:17.383
We can see here that these observations are below the regression line,
00:06:17.383 --> 00:06:19.367
and these are above the regression line.
00:06:19.367 --> 00:06:21.561
So instead of fitting a line,
00:06:21.561 --> 00:06:24.287
fitting some kind of curve that bends up
00:06:24.287 --> 00:06:25.663
would be better,
00:06:25.663 --> 00:06:27.510
something like that.
00:06:27.510 --> 00:06:31.277
So instead of saying that these
are characterized by a line,
00:06:31.277 --> 00:06:36.720
we say that these observations are
characterized by this blue curve here.
00:06:37.191 --> 00:06:38.511
And that is,
00:06:38.511 --> 00:06:40.680
what the log transformation does for us
00:06:40.680 --> 00:06:42.990
and it's the important reason why we use it.
00:06:43.354 --> 00:06:45.184
So instead of saying that
00:06:45.184 --> 00:06:50.820
income increases as a
constant function of prestige,
00:06:50.820 --> 00:06:53.220
we say that income increases as
00:06:53.220 --> 00:06:56.340
a relative function to the
current level of income,
00:06:56.340 --> 00:06:57.930
as a function of prestige.
00:06:59.258 --> 00:07:02.580
Let's take a log transformation of income and
00:07:02.580 --> 00:07:04.350
run a regression analysis.
00:07:04.907 --> 00:07:06.570
So here's my regression analysis.
00:07:06.570 --> 00:07:07.646
This is the income,
00:07:07.774 --> 00:07:10.084
done with R, using this data.
00:07:10.084 --> 00:07:12.660
We can see the one unit increase in prestige
00:07:12.660 --> 00:07:18.458
leads to 176 Canadian dollars more per year,
00:07:18.458 --> 00:07:21.840
and then when we have a log of income,
00:07:21.840 --> 00:07:26.640
then log of income increases by 0.03,
00:07:26.640 --> 00:07:29.700
for every additional unit of prestige.
00:07:30.942 --> 00:07:34.002
So the problem with this,
00:07:34.002 --> 00:07:37.710
we know that the log first has
a slightly higher R-squared
00:07:37.710 --> 00:07:40.050
and also slightly higher adjusted R-squared,
00:07:40.050 --> 00:07:40.830
that the income.
00:07:40.830 --> 00:07:42.553
So based on that metric,
00:07:42.553 --> 00:07:47.040
we can make an informal judgment
that this is could be a better model.
00:07:47.040 --> 00:07:51.360
It's not certain that a better or a higher
R-squared means that it's a better model,
00:07:51.360 --> 00:07:52.676
but it could be.
00:07:53.019 --> 00:07:56.623
How we judge models will come up later videos.
00:07:57.973 --> 00:07:59.310
So how do we interpret?
00:07:59.310 --> 00:08:02.940
What does this 0.03 increase
in the log of revenue,
00:08:02.940 --> 00:08:04.020
log of income mean?
00:08:04.855 --> 00:08:08.400
For most people the metric of a log of income
00:08:08.400 --> 00:08:10.431
doesn't have any meaning.
00:08:10.838 --> 00:08:16.170
Someone tells me that the logarithm
of your income will increase by 0.01,
00:08:16.256 --> 00:08:18.386
I know what it means because I've done this,
00:08:18.407 --> 00:08:20.417
I've read my statistics books,
00:08:20.417 --> 00:08:21.776
most people don't.
00:08:22.311 --> 00:08:24.681
So how do we interpret?
00:08:24.788 --> 00:08:30.235
There are two ways of interpreting
the log transformation results.
00:08:30.235 --> 00:08:34.290
One is the general way of
interpreting any nonlinear effects,
00:08:34.590 --> 00:08:35.910
and that is plotting.
00:08:36.081 --> 00:08:37.821
So you can do,
00:08:37.821 --> 00:08:41.246
here are the regression results
for the log transformation model.
00:08:41.246 --> 00:08:43.350
What we do here is that
00:08:43.350 --> 00:08:49.795
we calculate the fitted values of the
logarithm of income based on prestige.
00:08:49.795 --> 00:08:53.091
So this is simply taking the formula,
00:08:53.091 --> 00:09:00.060
adding intercept 7.46 plus 0.02 times 20.
00:09:00.724 --> 00:09:03.784
So that provides us with the fitted income.
00:09:03.784 --> 00:09:08.850
And the hat here denotes
that this is a fitted value
00:09:08.850 --> 00:09:10.663
from the regression analysis.
00:09:10.663 --> 00:09:14.370
Then we take exponentials of these incomes.
00:09:14.734 --> 00:09:16.714
So when you take a logarithm of a number,
00:09:16.843 --> 00:09:19.003
you get another number.
00:09:19.003 --> 00:09:21.720
When you apply exponential to that other number,
00:09:21.720 --> 00:09:24.150
you get back your original number.
00:09:24.150 --> 00:09:28.440
So we say that the exponential is
the inverse function of a logarithm,
00:09:28.440 --> 00:09:31.500
and logarithm is the inverse
function of exponential.
00:09:31.500 --> 00:09:35.610
Because we can apply 1 to
get back the original number,
00:09:35.610 --> 00:09:39.094
that was used as an input for the other.
00:09:39.523 --> 00:09:44.460
So exponential transformation allows us
to kind of undo the log transformation,
00:09:44.460 --> 00:09:48.583
and we get these predicted incomes
00:09:48.819 --> 00:09:51.373
for each prestigious level.
00:09:51.373 --> 00:09:54.428
Then we plot the data,
00:09:54.428 --> 00:09:59.280
so we plot these exponentiated logs
or predicting logs of income here,
00:09:59.280 --> 00:10:01.590
and as a function of prestige,
00:10:01.590 --> 00:10:02.700
we get this curve.
00:10:03.578 --> 00:10:05.640
So whenever you don't know,
00:10:05.640 --> 00:10:08.850
how to interpret a particular regression estimate
00:10:08.850 --> 00:10:14.400
that has been calculated
based on some transformation.
00:10:14.828 --> 00:10:18.150
One very good way of doing
that is to plot the effect.
00:10:18.150 --> 00:10:21.870
You can also plot the linear model effects only
00:10:21.870 --> 00:10:23.280
and then you can compare,
00:10:23.280 --> 00:10:25.260
which one looks more reasonable.
00:10:25.538 --> 00:10:26.708
Here the blue curve,
00:10:26.751 --> 00:10:28.641
the log-transformed results,
00:10:28.641 --> 00:10:30.690
look a lot more reasonable
explanation for the data
00:10:30.690 --> 00:10:32.940
than the red line.
00:10:33.347 --> 00:10:35.447
So that is one way,
00:10:35.447 --> 00:10:36.403
the general way
00:10:36.403 --> 00:10:41.190
that you can interpret any nonlinear effects.
00:10:41.190 --> 00:10:43.620
And this kind of plot, where you draw a line,
00:10:43.620 --> 00:10:46.440
it's called a marginal prediction plot.
00:10:46.440 --> 00:10:48.240
We will cover this later on the course.
00:10:49.718 --> 00:10:56.070
Another way of interpreting regression
analysis results after log transformation
00:10:56.070 --> 00:10:57.930
is to interpret them directly.
00:10:58.380 --> 00:11:01.774
So log transformation is a
special case of transformations,
00:11:01.774 --> 00:11:04.684
because it has a natural interpretation.
00:11:04.877 --> 00:11:07.680
These interpretations are given
by Wooldridge's book here.
00:11:07.680 --> 00:11:13.295
So when we take the log of
the dependent variable then
00:11:13.295 --> 00:11:15.490
each of these regression coefficients,
00:11:15.490 --> 00:11:17.388
here only for prestige,
00:11:17.860 --> 00:11:19.287
change their meaning.
00:11:19.287 --> 00:11:24.760
So the meaning of this unit increase of prestige
00:11:24.760 --> 00:11:28.000
is actually translated to relative increase.
00:11:28.000 --> 00:11:30.970
So beta1 of prestige here,
00:11:30.970 --> 00:11:34.735
doesn't tell us, what is the
unit increase of prestige,
00:11:34.735 --> 00:11:37.330
what is that's the effect on income?
00:11:37.673 --> 00:11:39.503
Instead, it tells,
00:11:39.503 --> 00:11:43.891
what is the effect of one
unit increase of prestige
00:11:44.577 --> 00:11:47.500
on the relative income.
00:11:47.500 --> 00:11:55.660
So if the regression coefficient
of prestige is 0.025,
00:11:55.660 --> 00:11:57.130
like it's here,
00:11:57.323 --> 00:11:58.883
then it means that
00:11:58.883 --> 00:12:08.410
one unit increase in prestige leads
to a 2.5 % increase in salary,
00:12:08.410 --> 00:12:10.360
compared to a current salary level.
00:12:10.681 --> 00:12:13.441
So it's an exponential growth model,
00:12:13.441 --> 00:12:15.910
that's why we use the exponential function.
00:12:16.381 --> 00:12:21.833
So every time your prestige of
occupation increases by one,
00:12:21.833 --> 00:12:26.380
then your salary goes up 2.5 %
compared to the previous level.
00:12:26.937 --> 00:12:31.720
Calculating, how much for
example ten units would mean,
00:12:31.720 --> 00:12:32.890
could be a bit difficult,
00:12:32.890 --> 00:12:37.210
because we have to take 2 %
00:12:37.210 --> 00:12:40.210
and then apply that ten times, 2.5 %.
00:12:40.210 --> 00:12:44.410
So it's a 0.025 to the power of 10
00:12:44.410 --> 00:12:48.040
and then you will get the effect
of 10 unit increase of prestige.
00:12:48.490 --> 00:12:54.078
In practice, your statistical software will
do the marginal effects calculations for you.
00:12:54.078 --> 00:12:57.970
So doing a plot like that would
simplify the interpretation,
00:12:57.970 --> 00:12:59.530
because you can see directly,
00:12:59.530 --> 00:13:04.900
what is the effect of moving from
prestige of 40 to prestige of 60
00:13:04.900 --> 00:13:06.258
by taking the line.
00:13:06.258 --> 00:13:10.270
Also, the software will give you
the numbers behind these plots.
00:13:10.677 --> 00:13:14.650
So that's how you calculate marginal effects.
00:13:14.650 --> 00:13:16.930
The actual calculation is
covered in a different video.