WEBVTT Kind: captions Language: en 00:00:00.000 --> 00:00:02.640 Regression analysis tells  us the relationship between 00:00:02.640 --> 00:00:06.180 one dependent variable and one  or more independent variable. 00:00:06.973 --> 00:00:09.180 One of the problems with regression analysis, 00:00:09.180 --> 00:00:13.380 or one of the limitations is that it  focuses on linear relationships only. 00:00:13.380 --> 00:00:18.874 However, many relationships in nature  and social life are nonlinear in nature. 00:00:19.281 --> 00:00:24.000 And one very useful technique for dealing  with that kind of relationships is, 00:00:24.000 --> 00:00:26.999 the log transformation or 00:00:26.999 --> 00:00:30.810 logarithm transformation if we  write the log in a long-form. 00:00:31.517 --> 00:00:32.820 So what does that actually do, 00:00:32.820 --> 00:00:34.333 what does log transformation do? 00:00:34.590 --> 00:00:37.380 Many papers contain statements like this. 00:00:37.851 --> 00:00:41.250 We use the log of the revenue since 00:00:41.250 --> 00:00:44.100 revenue for our firms is highly skewed. 00:00:44.100 --> 00:00:46.170 So that's very common, 00:00:46.170 --> 00:00:48.463 the researchers say that something is skewed 00:00:48.463 --> 00:00:52.260 and we take a log of something  to make it more normal. 00:00:52.988 --> 00:00:55.170 That has a couple of issues, 00:00:55.170 --> 00:00:56.550 that kind of statement. 00:00:56.550 --> 00:00:58.500 But let's first take a look at 00:00:58.500 --> 00:01:01.020 what log transformation does to address skewness? 00:01:01.577 --> 00:01:03.257 So these are the data 00:01:03.257 --> 00:01:07.320 from the largest 500 Finnish companies in 2005, 00:01:07.427 --> 00:01:09.167 the revenues for those companies. 00:01:09.167 --> 00:01:12.240 So we have one very large company here, 00:01:12.561 --> 00:01:14.361 then some companies here 00:01:14.361 --> 00:01:18.990 and most are here around a few  hundred million euros of revenue. 00:01:18.990 --> 00:01:22.080 So we have a couple of billion-euro companies, 00:01:22.080 --> 00:01:25.873 and most companies are in the  hundreds of millions range. 00:01:25.873 --> 00:01:28.933 So this distribution is highly skewed, 00:01:28.933 --> 00:01:33.270 it means that there is this long tail here, 00:01:33.270 --> 00:01:36.660 so we have, most observations are clustered here, 00:01:37.453 --> 00:01:41.010 and then we have some that  go to this long tail here. 00:01:41.010 --> 00:01:45.000 This kind of skewed distributions  are sometimes problematic, 00:01:45.000 --> 00:01:46.928 but we have to understand 00:01:46.928 --> 00:01:51.441 that, for example, regression  analysis makes no assumptions about, 00:01:51.441 --> 00:01:54.360 how observed variables are distributed. 00:01:54.360 --> 00:01:57.000 It makes some assumptions but 00:01:57.000 --> 00:02:00.908 the distribution of observed  variables is not one of those. 00:02:00.908 --> 00:02:02.811 If we take a logarithm of this, 00:02:02.811 --> 00:02:04.333 every revenue here, 00:02:04.333 --> 00:02:06.432 we get the distribution that looks like that. 00:02:06.432 --> 00:02:10.230 So we get something that doesn't  have as a long tail as before, 00:02:10.230 --> 00:02:15.150 so now the observations are more  closely clustered around the mean, 00:02:15.150 --> 00:02:17.310 there is still some tail here, 00:02:17.310 --> 00:02:18.360 but not as severe. 00:02:18.724 --> 00:02:22.530 So these units here are now logarithms. 00:02:22.530 --> 00:02:26.400 I'm using the base 10 here  for ease of exposition but 00:02:26.400 --> 00:02:28.440 normally we use the natural logarithm, 00:02:28.440 --> 00:02:30.960 it doesn't really make a  difference for your analysis. 00:02:31.346 --> 00:02:35.006 So this is the 100 million thresholds, 00:02:35.177 --> 00:02:37.127 this is the 1 billion thresholds, 00:02:37.127 --> 00:02:40.350 then we have 10 billion and then  100 billion thresholds here. 00:02:40.800 --> 00:02:46.710 So we change the scale of the  variable by taking a logarithm. 00:02:49.023 --> 00:02:52.384 What the logarithm transformation actually does? 00:02:52.513 --> 00:02:55.063 It changes the shape of the distribution, 00:02:55.063 --> 00:02:56.910 so this is highly skewed, 00:02:57.188 --> 00:02:59.468 this is still skewed but less so. 00:02:59.468 --> 00:03:03.720 So in some cases, it actually  reduces the skewness of data, 00:03:03.720 --> 00:03:06.300 but that's not the reason why we actually use it. 00:03:06.300 --> 00:03:09.690 So we don't need our data to be normal 00:03:09.690 --> 00:03:13.890 but instead sometimes thinking  in terms of relative units 00:03:13.890 --> 00:03:16.140 makes a lot more sense than 00:03:16.140 --> 00:03:18.060 thinking in terms of absolute units. 00:03:18.060 --> 00:03:20.580 So absolute units here means that 00:03:20.580 --> 00:03:29.730 the difference between 0 and 1 billion  is the same as 1 billion and 2 billion. 00:03:30.908 --> 00:03:34.140 So let's think for a while, 00:03:34.140 --> 00:03:37.890 does it make sense to say that 00:03:37.890 --> 00:03:40.500 when a company grows to 0 to 1 billion 00:03:40.650 --> 00:03:43.920 is it the same kind of  transformation for the company 00:03:43.920 --> 00:03:46.140 than when it grows from 1 billion to two billion? 00:03:46.783 --> 00:03:49.273 No, that doesn't make any sense. 00:03:49.273 --> 00:03:51.840 Also, companies nearly don't say 00:03:51.840 --> 00:03:54.240 that we grew this in this many euros, 00:03:54.240 --> 00:04:03.056 instead, we grew by 10% or 15% compared to the previous year's revenue. 00:04:03.056 --> 00:04:08.850 So quite often we like to  compare things in relative terms. 00:04:08.850 --> 00:04:14.370 You get your salary increases  based on labour union negotiations, 00:04:14.584 --> 00:04:16.984 they are hardly ever fixed euro amounts, 00:04:16.984 --> 00:04:19.860 they are 1 % - 2 %, 00:04:19.860 --> 00:04:23.280 something related to your current salary level. 00:04:23.280 --> 00:04:24.810 So they are relative units. 00:04:24.810 --> 00:04:29.764 So here the relative units mean that  the difference between 1 billion, 00:04:32.206 --> 00:04:38.250 or 100 million and 1 billion is relatively  the same than the difference between 00:04:38.250 --> 00:04:40.448 1 billion and 10 billion. 00:04:40.448 --> 00:04:45.660 So each space between these two ticks doesn't 00:04:45.660 --> 00:04:49.290 refer to unit increase, 00:04:49.290 --> 00:04:52.461 instead, it refers to a tenfold increase. 00:04:52.697 --> 00:04:56.657 So things increase relative to the previous level. 00:04:59.634 --> 00:05:01.187 So let's take a look at, 00:05:01.187 --> 00:05:03.587 what it means to run a regression  analysis with log transformation, 00:05:03.587 --> 00:05:05.241 and why would we want to do that? 00:05:06.960 --> 00:05:09.330 Transforming the variables to be less skewed 00:05:09.330 --> 00:05:13.384 is not the right reason to use log transformation 00:05:13.384 --> 00:05:15.870 and if you want to reduce skewness, 00:05:15.870 --> 00:05:18.000 you, of course, can do log transformation, 00:05:18.000 --> 00:05:19.680 but you have to understand that 00:05:19.680 --> 00:05:23.160 there are other more important  reasons to use log transformation 00:05:23.160 --> 00:05:26.516 and it also influences how  you interpret your results. 00:05:27.244 --> 00:05:31.170 So this is the example data  set from the Prestige data set, 00:05:31.170 --> 00:05:36.210 these are occupations from the  Canada census of 1930-70-something. 00:05:36.424 --> 00:05:39.780 And we have prestige score of an occupation 00:05:39.780 --> 00:05:42.150 and then the average income of an occupation. 00:05:42.150 --> 00:05:43.731 We're interested in learning, 00:05:43.731 --> 00:05:46.260 how much income depends on prestige. 00:05:46.967 --> 00:05:49.680 We can see that there is a linear effect here, 00:05:49.680 --> 00:05:53.331 prestige goes from 20 to 80 and 00:05:53.331 --> 00:05:58.890 first income increases and  then it starts to increase 00:05:58.890 --> 00:06:00.381 in a nonlinear fashion. 00:06:00.381 --> 00:06:03.120 So if we were to draw a line or a curve, 00:06:03.120 --> 00:06:07.350 it would first go flat and  then it would curve up a bit. 00:06:08.014 --> 00:06:12.750 So the line here is not the best  description of the data. 00:06:13.136 --> 00:06:17.383 We can see here that these observations are below the regression line, 00:06:17.383 --> 00:06:19.367 and these are above the regression line. 00:06:19.367 --> 00:06:21.561 So instead of fitting a line, 00:06:21.561 --> 00:06:24.287 fitting some kind of curve that bends up 00:06:24.287 --> 00:06:25.663 would be better, 00:06:25.663 --> 00:06:27.510 something like that. 00:06:27.510 --> 00:06:31.277 So instead of saying that these  are characterized by a line, 00:06:31.277 --> 00:06:36.720 we say that these observations are  characterized by this blue curve here. 00:06:37.191 --> 00:06:38.511 And that is, 00:06:38.511 --> 00:06:40.680 what the log transformation does for us 00:06:40.680 --> 00:06:42.990 and it's the important reason why we use it. 00:06:43.354 --> 00:06:45.184 So instead of saying that 00:06:45.184 --> 00:06:50.820 income increases as a  constant function of prestige, 00:06:50.820 --> 00:06:53.220 we say that income increases as 00:06:53.220 --> 00:06:56.340 a relative function to the  current level of income, 00:06:56.340 --> 00:06:57.930 as a function of prestige. 00:06:59.258 --> 00:07:02.580 Let's take a log transformation of income and 00:07:02.580 --> 00:07:04.350 run a regression analysis. 00:07:04.907 --> 00:07:06.570 So here's my regression analysis. 00:07:06.570 --> 00:07:07.646 This is the income, 00:07:07.774 --> 00:07:10.084 done with R, using this data. 00:07:10.084 --> 00:07:12.660 We can see the one unit increase in prestige 00:07:12.660 --> 00:07:18.458 leads to 176 Canadian dollars more per year, 00:07:18.458 --> 00:07:21.840 and then when we have a log of income, 00:07:21.840 --> 00:07:26.640 then log of income increases by 0.03, 00:07:26.640 --> 00:07:29.700 for every additional unit of prestige. 00:07:30.942 --> 00:07:34.002 So the problem with this, 00:07:34.002 --> 00:07:37.710 we know that the log first has  a slightly higher R-squared 00:07:37.710 --> 00:07:40.050 and also slightly higher adjusted R-squared, 00:07:40.050 --> 00:07:40.830 that the income. 00:07:40.830 --> 00:07:42.553 So based on that metric, 00:07:42.553 --> 00:07:47.040 we can make an informal judgment  that this is could be a better model. 00:07:47.040 --> 00:07:51.360 It's not certain that a better or a higher  R-squared means that it's a better model, 00:07:51.360 --> 00:07:52.676 but it could be. 00:07:53.019 --> 00:07:56.623 How we judge models will come up later videos. 00:07:57.973 --> 00:07:59.310 So how do we interpret? 00:07:59.310 --> 00:08:02.940 What does this 0.03 increase  in the log of revenue, 00:08:02.940 --> 00:08:04.020 log of income mean? 00:08:04.855 --> 00:08:08.400 For most people the metric of a log of income 00:08:08.400 --> 00:08:10.431 doesn't have any meaning. 00:08:10.838 --> 00:08:16.170 Someone tells me that the logarithm  of your income will increase by 0.01, 00:08:16.256 --> 00:08:18.386 I know what it means because I've done this, 00:08:18.407 --> 00:08:20.417 I've read my statistics books, 00:08:20.417 --> 00:08:21.776 most people don't. 00:08:22.311 --> 00:08:24.681 So how do we interpret? 00:08:24.788 --> 00:08:30.235 There are two ways of interpreting  the log transformation results. 00:08:30.235 --> 00:08:34.290 One is the general way of  interpreting any nonlinear effects, 00:08:34.590 --> 00:08:35.910 and that is plotting. 00:08:36.081 --> 00:08:37.821 So you can do, 00:08:37.821 --> 00:08:41.246 here are the regression results  for the log transformation model. 00:08:41.246 --> 00:08:43.350 What we do here is that 00:08:43.350 --> 00:08:49.795 we calculate the fitted values of the  logarithm of income based on prestige. 00:08:49.795 --> 00:08:53.091 So this is simply taking the formula, 00:08:53.091 --> 00:09:00.060 adding intercept 7.46 plus 0.02 times 20. 00:09:00.724 --> 00:09:03.784 So that provides us with the fitted income. 00:09:03.784 --> 00:09:08.850 And the hat here denotes  that this is a fitted value 00:09:08.850 --> 00:09:10.663 from the regression analysis. 00:09:10.663 --> 00:09:14.370 Then we take exponentials of these incomes. 00:09:14.734 --> 00:09:16.714 So when you take a logarithm of a number, 00:09:16.843 --> 00:09:19.003 you get another number. 00:09:19.003 --> 00:09:21.720 When you apply exponential to that other number, 00:09:21.720 --> 00:09:24.150 you get back your original number. 00:09:24.150 --> 00:09:28.440 So we say that the exponential is  the inverse function of a logarithm, 00:09:28.440 --> 00:09:31.500 and logarithm is the inverse  function of exponential. 00:09:31.500 --> 00:09:35.610 Because we can apply 1 to  get back the original number, 00:09:35.610 --> 00:09:39.094 that was used as an input for the other. 00:09:39.523 --> 00:09:44.460 So exponential transformation allows us  to kind of undo the log transformation, 00:09:44.460 --> 00:09:48.583 and we get these predicted incomes 00:09:48.819 --> 00:09:51.373 for each prestigious level. 00:09:51.373 --> 00:09:54.428 Then we plot the data, 00:09:54.428 --> 00:09:59.280 so we plot these exponentiated logs  or predicting logs of income here, 00:09:59.280 --> 00:10:01.590 and as a function of prestige, 00:10:01.590 --> 00:10:02.700 we get this curve. 00:10:03.578 --> 00:10:05.640 So whenever you don't know, 00:10:05.640 --> 00:10:08.850 how to interpret a particular regression estimate 00:10:08.850 --> 00:10:14.400 that has been calculated  based on some transformation. 00:10:14.828 --> 00:10:18.150 One very good way of doing  that is to plot the effect. 00:10:18.150 --> 00:10:21.870 You can also plot the linear model effects only 00:10:21.870 --> 00:10:23.280 and then you can compare, 00:10:23.280 --> 00:10:25.260 which one looks more reasonable. 00:10:25.538 --> 00:10:26.708 Here the blue curve, 00:10:26.751 --> 00:10:28.641 the log-transformed results, 00:10:28.641 --> 00:10:30.690 look a lot more reasonable  explanation for the data 00:10:30.690 --> 00:10:32.940 than the red line. 00:10:33.347 --> 00:10:35.447 So that is one way, 00:10:35.447 --> 00:10:36.403 the general way 00:10:36.403 --> 00:10:41.190 that you can interpret any nonlinear effects. 00:10:41.190 --> 00:10:43.620 And this kind of plot, where you draw a line, 00:10:43.620 --> 00:10:46.440 it's called a marginal prediction plot. 00:10:46.440 --> 00:10:48.240 We will cover this later on the course. 00:10:49.718 --> 00:10:56.070 Another way of interpreting regression  analysis results after log transformation 00:10:56.070 --> 00:10:57.930 is to interpret them directly. 00:10:58.380 --> 00:11:01.774 So log transformation is a  special case of transformations, 00:11:01.774 --> 00:11:04.684 because it has a natural interpretation. 00:11:04.877 --> 00:11:07.680 These interpretations are given  by Wooldridge's book here. 00:11:07.680 --> 00:11:13.295 So when we take the log of  the dependent variable then 00:11:13.295 --> 00:11:15.490 each of these regression coefficients, 00:11:15.490 --> 00:11:17.388 here only for prestige, 00:11:17.860 --> 00:11:19.287 change their meaning. 00:11:19.287 --> 00:11:24.760 So the meaning of this unit increase of prestige 00:11:24.760 --> 00:11:28.000 is actually translated to relative increase. 00:11:28.000 --> 00:11:30.970 So beta1 of prestige here, 00:11:30.970 --> 00:11:34.735 doesn't tell us, what is the  unit increase of prestige, 00:11:34.735 --> 00:11:37.330 what is that's the effect on income? 00:11:37.673 --> 00:11:39.503 Instead, it tells, 00:11:39.503 --> 00:11:43.891 what is the effect of one  unit increase of prestige 00:11:44.577 --> 00:11:47.500 on the relative income. 00:11:47.500 --> 00:11:55.660 So if the regression coefficient  of prestige is 0.025, 00:11:55.660 --> 00:11:57.130 like it's here, 00:11:57.323 --> 00:11:58.883 then it means that 00:11:58.883 --> 00:12:08.410 one unit increase in prestige leads  to a 2.5 % increase in salary, 00:12:08.410 --> 00:12:10.360 compared to a current salary level. 00:12:10.681 --> 00:12:13.441 So it's an exponential growth model, 00:12:13.441 --> 00:12:15.910 that's why we use the exponential function. 00:12:16.381 --> 00:12:21.833 So every time your prestige of  occupation increases by one, 00:12:21.833 --> 00:12:26.380 then your salary goes up 2.5 %  compared to the previous level. 00:12:26.937 --> 00:12:31.720 Calculating, how much for  example ten units would mean, 00:12:31.720 --> 00:12:32.890 could be a bit difficult, 00:12:32.890 --> 00:12:37.210 because we have to take 2 % 00:12:37.210 --> 00:12:40.210 and then apply that ten times, 2.5 %. 00:12:40.210 --> 00:12:44.410 So it's a 0.025 to the power of 10 00:12:44.410 --> 00:12:48.040 and then you will get the effect  of 10 unit increase of prestige. 00:12:48.490 --> 00:12:54.078 In practice, your statistical software will  do the marginal effects calculations for you. 00:12:54.078 --> 00:12:57.970 So doing a plot like that would  simplify the interpretation, 00:12:57.970 --> 00:12:59.530 because you can see directly, 00:12:59.530 --> 00:13:04.900 what is the effect of moving from  prestige of 40 to prestige of 60 00:13:04.900 --> 00:13:06.258 by taking the line. 00:13:06.258 --> 00:13:10.270 Also, the software will give you  the numbers behind these plots. 00:13:10.677 --> 00:13:14.650 So that's how you calculate marginal effects. 00:13:14.650 --> 00:13:16.930 The actual calculation is  covered in a different video.