WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:03.270 We will now take a look at the  interpretation regression coefficients. 00:00:03.270 --> 00:00:05.940 And the actual interpretation  of what the results mean 00:00:05.940 --> 00:00:09.870 is a more difficult part than  the calculation of the results. 00:00:09.870 --> 00:00:13.500 So, whenever you run a regression analysis, 00:00:14.280 --> 00:00:17.520 the regression coefficients  beta have to be interpreted, 00:00:17.520 --> 00:00:21.570 because the readers of your  research article don't know 00:00:21.570 --> 00:00:22.650 what the betas mean, 00:00:22.650 --> 00:00:23.700 so you have to tell them. 00:00:23.700 --> 00:00:28.620 And there are also other ways that the  regression analysis can be quantified. 00:00:28.620 --> 00:00:31.440 So regression analysis tells us 00:00:31.440 --> 00:00:34.140 what is the direction of an effect and, 00:00:34.140 --> 00:00:37.650 whether an effect is  statistically significant or not. 00:00:37.650 --> 00:00:41.070 What we want to know however is, 00:00:41.070 --> 00:00:43.020 whether the effects are large or not. 00:00:43.020 --> 00:00:44.670 And that depends on the interpretation. 00:00:44.670 --> 00:00:49.710 In some context regression  coefficient of 10 is very large, 00:00:49.710 --> 00:00:53.520 in other contexts, a regression  coefficient of 10 is very small. 00:00:53.520 --> 00:00:56.670 So you have to consider the context and also, 00:00:56.670 --> 00:00:58.410 what are the variables involved? 00:00:58.410 --> 00:01:04.080 One of the easiest ways to start  interpreting regression analysis is 00:01:04.080 --> 00:01:05.400 to look at the R-squared statistic. 00:01:05.400 --> 00:01:09.690 So the R-squared statistic is calculated  based on the regression results and 00:01:09.690 --> 00:01:14.400 is typically presented here on the  bottom of the regression analysis table. 00:01:14.400 --> 00:01:17.280 Another related statistic  is the adjusted R-squared. 00:01:17.280 --> 00:01:19.710 The R-squared statistic tells us, 00:01:19.710 --> 00:01:21.540 how much the variables, 00:01:21.540 --> 00:01:22.800 the independent variables together 00:01:22.800 --> 00:01:24.900 explain the dependent variable. 00:01:24.900 --> 00:01:28.260 And it's an estimate of the  quality of the model in some sense, 00:01:28.260 --> 00:01:31.620 sometimes it is referred to as  goodness of fit of a regression model, 00:01:31.620 --> 00:01:33.930 or as a coefficient of determination. 00:01:34.650 --> 00:01:37.590 Most people just refer it to us an R-squared. 00:01:37.590 --> 00:01:41.490 So the R-squared varies between 0 and 1. 00:01:41.490 --> 00:01:46.230 0 means that the independent variables  don't explain the dependent variable at all, 00:01:46.230 --> 00:01:52.140 1 means that the independent variables  completely explain the dependent values. 00:01:52.140 --> 00:01:55.500 One problem with R-squared  is that it always goes up 00:01:55.500 --> 00:01:57.240 when you add variables to the model. 00:01:57.240 --> 00:02:01.440 So when your number of  variables starts to increase 00:02:01.440 --> 00:02:03.690 toward the number of observations, 00:02:03.690 --> 00:02:09.330 for example, if you fit a model with  99 variables to 100 observations, 00:02:09.330 --> 00:02:11.580 the R-squared will be exactly 1. 00:02:11.580 --> 00:02:15.970 So it always increases and goes  up and it's positively biased. 00:02:15.970 --> 00:02:18.760 The bias here means that, 00:02:18.760 --> 00:02:22.060 if we calculate the regression  analysis using sample data, 00:02:22.060 --> 00:02:25.450 the results can be expected to be larger than 00:02:25.450 --> 00:02:29.350 if we run the same regression  analysis on the full population. 00:02:29.350 --> 00:02:32.680 Because the R-squared is positively biased, 00:02:32.680 --> 00:02:36.940 we have introduced the  adjusted R-squared statistic, 00:02:36.940 --> 00:02:39.820 which penalizes complex models. 00:02:39.820 --> 00:02:42.310 So when your R-squared goes up, 00:02:42.310 --> 00:02:44.740 just because you have too  many variables in the model, 00:02:44.740 --> 00:02:48.460 then adjusted r-square adjusts the r-squared down 00:02:48.460 --> 00:02:51.070 to compensate for that bias. 00:02:51.070 --> 00:02:54.070 So it calculates an adjusted value and 00:02:54.070 --> 00:02:56.620 the adjustment is based on  the number of observations 00:02:56.620 --> 00:02:58.090 and the sample size. 00:02:58.090 --> 00:03:00.550 When the sample size is large and 00:03:00.550 --> 00:03:03.640 you have a very small number of variables, 00:03:03.640 --> 00:03:08.920 for example, if you have five independent  variables and 500 observations, 00:03:08.920 --> 00:03:11.590 you have 100 observations for  each independent variable, 00:03:11.590 --> 00:03:13.360 the adjustment is very small. 00:03:13.360 --> 00:03:21.070 If you have, let's say 25 observations  and 100 units in your sample, 00:03:21.070 --> 00:03:22.990 then the adjustment is pretty large, 00:03:22.990 --> 00:03:27.160 because you have only four observations  for each independent variable. 00:03:27.160 --> 00:03:30.070 One problem is that 00:03:30.070 --> 00:03:32.140 the adjusted r-square is not unbiased either, 00:03:32.140 --> 00:03:38.170 but it can be expected to be less  biased than the actual R-squared. 00:03:38.170 --> 00:03:42.670 To actually get an unbiased estimate of the  population R-squared is quite difficult, 00:03:42.670 --> 00:03:44.200 so we don't normally do that. 00:03:44.200 --> 00:03:50.380 The R-squared tells us whether the  model explains the data at all, 00:03:50.380 --> 00:03:53.740 so when R-squared is 0 then  it's the end of interpretation, 00:03:53.740 --> 00:03:58.090 the variables, the independent variables  don't explain the dependent variable at all. 00:03:58.090 --> 00:04:00.040 Then the question is, 00:04:00.040 --> 00:04:02.620 how much is a meaningful explanation? 00:04:02.620 --> 00:04:05.500 If you explain 1 % of a phenomenon, 00:04:05.500 --> 00:04:07.480 in some context that is meaningful, 00:04:07.480 --> 00:04:09.400 in other contexts, it's not meaningful. 00:04:09.400 --> 00:04:13.120 The behavior of people and the  performance of organizations, 00:04:13.120 --> 00:04:15.430 it's very difficult to predict or explain, 00:04:15.430 --> 00:04:18.340 because it depends on so many different things. 00:04:18.340 --> 00:04:20.710 And therefore in social sciences 00:04:20.710 --> 00:04:26.230 the R-squared typically vary  in the 10-20-30 % ballpark. 00:04:26.230 --> 00:04:27.730 So if you have a 30 percent R-squared, 00:04:27.730 --> 00:04:29.260 then you have a pretty good explanation, 00:04:29.260 --> 00:04:31.780 or you could also have a flawed study, 00:04:31.780 --> 00:04:33.400 but we'll talk about that a bit later. 00:04:33.400 --> 00:04:37.090 So you have to consider the context. 00:04:37.090 --> 00:04:44.080 In natural sciences R-squared of 99  percent could be considered not large enough. 00:04:44.080 --> 00:04:48.550 So R-squared is useful for the first check of 00:04:48.550 --> 00:04:52.150 whether the interpretation of  the results further makes sense. 00:04:52.150 --> 00:04:53.320 If R-squared is too small, 00:04:53.320 --> 00:04:56.830 then we know that none of these  variables in the model actually matter 00:04:56.830 --> 00:04:58.540 for the dependent variable. 00:04:58.540 --> 00:05:01.900 So interpreting the effects of each  independent variable separately 00:05:01.900 --> 00:05:03.370 is a waste of time. 00:05:05.530 --> 00:05:11.140 Also, the R-squared offers us  an intuitive way of explaining 00:05:11.140 --> 00:05:12.850 whether the results are larger not. 00:05:12.850 --> 00:05:17.530 If I tell you that choosing, 00:05:17.530 --> 00:05:20.830 the choice between three  investment strategies for example 00:05:20.830 --> 00:05:26.230 explains 30% of the variation  of your investment profits, 00:05:26.230 --> 00:05:29.290 then that's a big deal. 00:05:29.290 --> 00:05:32.560 We understand that 30% is  a big deal in that context. 00:05:32.560 --> 00:05:35.290 So because R-squared can be  understood in percentages, 00:05:35.290 --> 00:05:39.070 it has a natural interpretation for most people. 00:05:39.070 --> 00:05:43.600 We'll take a look at how Hekman  uses the R-squared in his paper. 00:05:43.600 --> 00:05:45.910 So Hekman doesn't really interpret, 00:05:45.910 --> 00:05:48.970 what the actual regression  coefficients in their study mean. 00:05:48.970 --> 00:05:52.240 But they are basing their  interpretation of the magnitude 00:05:52.240 --> 00:05:53.980 of the effects on the R-squared. 00:05:53.980 --> 00:05:58.330 And they're saying that between  their control variables only model, 00:05:58.330 --> 00:06:00.130 and the variables, 00:06:00.130 --> 00:06:06.940 the model where there were  the gender and race variables, 00:06:06.940 --> 00:06:10.060 the R-squared increases between 15 to 20 percent. 00:06:10.060 --> 00:06:12.670 That can be interpreted to mean that, 00:06:12.670 --> 00:06:19.420 the effects of race and gender  are in the ballpark of 15 to 24 %, 00:06:19.420 --> 00:06:22.390 assuming that there's no bias in R-squared, 00:06:22.390 --> 00:06:23.110 which is not true. 00:06:23.110 --> 00:06:26.440 So they should really be looking at  the adjusted R-squared in this case. 00:06:26.440 --> 00:06:28.720 But everyone understands that 00:06:28.720 --> 00:06:33.130 if we say that the customer  satisfaction score's variation, 00:06:33.130 --> 00:06:37.630 one-fourth of that is  explained by gender and race. 00:06:37.630 --> 00:06:40.330 Everyone understands that that's a big deal, 00:06:40.330 --> 00:06:43.540 everyone who understand percentages. 00:06:43.540 --> 00:06:46.060 It provides us with an easy way of saying 00:06:46.060 --> 00:06:50.980 whether the results are of any practical meaning. 00:06:50.980 --> 00:06:54.640 When you have looked at the R-squared, 00:06:54.640 --> 00:06:56.500 the next thing that we want to know is, 00:06:56.500 --> 00:06:59.590 which of the individual variables matters. 00:06:59.590 --> 00:07:04.630 And that's where we get to the interpretation  of the regression coefficients. 00:07:04.630 --> 00:07:08.200 Let's take a look at the Talouselämää 500 example. 00:07:08.200 --> 00:07:10.540 So we have a sample 00:07:10.540 --> 00:07:15.280 where the women-led companies  are 4.7 percentage points 00:07:15.280 --> 00:07:17.470 more profitable than man-led companies. 00:07:17.470 --> 00:07:19.300 And that's a big difference in ROA. 00:07:19.300 --> 00:07:22.750 We want to know whether the  difference is caused by a woman 00:07:22.750 --> 00:07:27.010 or whether it's caused by some third factor. 00:07:27.010 --> 00:07:29.770 So we have to present  alternative competing hypotheses. 00:07:29.770 --> 00:07:32.410 One competing hypothesis is 00:07:32.410 --> 00:07:35.020 that it is not an effect of CEO gender, 00:07:35.020 --> 00:07:37.030 instead, it's an effect, 00:07:37.030 --> 00:07:40.660 it's a spurious correlation  caused by firm revenue. 00:07:40.660 --> 00:07:44.500 So that smaller companies are  more likely to hire women, 00:07:44.500 --> 00:07:46.840 and smaller companies are also more profitable. 00:07:46.840 --> 00:07:50.470 Another competing hypothesis, 00:07:50.470 --> 00:07:52.060 or second competing hypotheses is 00:07:52.060 --> 00:07:53.500 that this is an industry difference. 00:07:53.500 --> 00:07:58.600 For example, manufacturing companies  are less profitable in ROA metric, 00:07:58.600 --> 00:08:00.910 because ROA depends on assets 00:08:00.910 --> 00:08:03.580 and these companies tend to have  more assets than service companies, 00:08:03.580 --> 00:08:09.790 and manufacturing companies are more  likely to hire male CEOs than women CEOs. 00:08:09.790 --> 00:08:11.380 So we have the other variable here. 00:08:11.380 --> 00:08:13.690 Now regression analysis tells us, 00:08:13.690 --> 00:08:17.980 what is the effect of CEO gender ceteris paribus, 00:08:17.980 --> 00:08:22.930 which is an economics term for  holding other variables constant. 00:08:22.930 --> 00:08:27.880 So when the CEO gender changes from 00:08:27.880 --> 00:08:30.700 zero indicating man to one indicating a woman, 00:08:30.700 --> 00:08:33.790 what is the expected increase in return on assets. 00:08:33.790 --> 00:08:38.800 Holding things constant means 00:08:38.800 --> 00:08:43.150 that you are comparing two cases  that are exactly comparable 00:08:43.150 --> 00:08:44.830 on the other variables. 00:08:44.830 --> 00:08:46.810 So if we have two companies 00:08:46.810 --> 00:08:49.570 that are of the same size and same industry, 00:08:49.570 --> 00:08:55.180 then woman-led companies on  average, beta 1 more profitable. 00:08:55.180 --> 00:08:58.180 So the regression coefficient directly tells us, 00:08:58.180 --> 00:09:00.310 what is the profitability difference. 00:09:00.310 --> 00:09:05.560 If it's 1 percentage points, 2  percentage points or 3 percentage points, 00:09:05.560 --> 00:09:07.780 then it's up to us to interpret, 00:09:07.780 --> 00:09:09.460 whether it's a big effect or not. 00:09:09.460 --> 00:09:12.490 We know that 4.7 percentage  points is a big difference, 00:09:12.490 --> 00:09:16.360 one point, probably not so big difference. 00:09:16.360 --> 00:09:22.780 Okay so interpreting regression  coefficients is relatively straightforward 00:09:22.780 --> 00:09:25.510 when these variables have a meaningful unit. 00:09:25.510 --> 00:09:29.740 So we know that ROA has a  meaningful unit for managers. 00:09:30.760 --> 00:09:35.680 Everyone, if we said to a manager  that my company's ROA is 20%, 00:09:35.680 --> 00:09:38.830 they know that it's pretty  good for most industries. 00:09:39.400 --> 00:09:42.760 And we also know that the CEO is female, 00:09:42.760 --> 00:09:44.110 1 it's a woman, 00:09:44.110 --> 00:09:44.980 0 it's a man, 00:09:44.980 --> 00:09:46.390 so it has some meaning for us. 00:09:46.390 --> 00:09:50.950 Sometimes we have units that  don't really have any meanings, 00:09:50.950 --> 00:09:54.670 and that complicates the interpretation. 00:09:54.670 --> 00:09:56.650 So let's take a look at this question. 00:09:56.650 --> 00:10:01.330 Does one unit increase in education, 00:10:01.330 --> 00:10:02.200 does it pay off? 00:10:02.200 --> 00:10:05.920 We have a statemen, a regression result, 00:10:05.920 --> 00:10:10.510 that one unit increase in education  leads to one unit increase in salary. 00:10:10.510 --> 00:10:12.310 Is it a big deal? 00:10:12.310 --> 00:10:14.710 We would need to know, 00:10:14.710 --> 00:10:17.230 what is the unit of education, 00:10:17.230 --> 00:10:18.490 what is the unit in salary? 00:10:18.490 --> 00:10:21.790 Let's say that the unit is education in years, 00:10:21.790 --> 00:10:25.690 and salary is euros per year. 00:10:25.690 --> 00:10:31.990 So we say one year increase in education leads to 00:10:31.990 --> 00:10:33.760 one year increase in annual salary. 00:10:33.760 --> 00:10:34.780 Does it make a difference? 00:10:34.780 --> 00:10:38.410 I would think not, for most people. 00:10:38.410 --> 00:10:40.090 Pretty much every people, 00:10:40.090 --> 00:10:41.620 no one really wants to go to school 00:10:41.620 --> 00:10:44.440 if you just get one additional  euro of income per year. 00:10:44.440 --> 00:10:47.800 So that way, it's not meaningful. 00:10:47.800 --> 00:10:52.720 How about 1-year increase leads to a  1000 year increase in annual salary? 00:10:52.720 --> 00:10:56.440 That's a more problematic question. 00:10:56.440 --> 00:10:58.750 If we consider Finland, 00:10:58.750 --> 00:11:04.420 where salaries annually are  in tens of thousands of euros, 00:11:04.420 --> 00:11:06.580 maybe in the lower end, 00:11:06.580 --> 00:11:10.420 if you make 20 thousand per year, 00:11:10.420 --> 00:11:14.950 maybe 1000 is worth one year  of education, maybe not, 00:11:14.950 --> 00:11:16.060 depends on, its 5%, 00:11:16.060 --> 00:11:19.240 depends on how much you like to go to school. 00:11:19.240 --> 00:11:21.370 On the other hand, 00:11:21.370 --> 00:11:24.280 if this data were from a developing country, 00:11:24.280 --> 00:11:30.370 where the annual salaries are in the  thousand, two thousand euro ballpark. 00:11:30.370 --> 00:11:34.420 Then one euro increase in the  annual salary is a big deal, 00:11:34.420 --> 00:11:39.190 you can double your income  basically in some cases, 00:11:39.190 --> 00:11:42.010 if you go to one additional year of school. 00:11:42.010 --> 00:11:45.430 And that's a big thing for those people. 00:11:45.430 --> 00:11:46.900 So you have to think of, 00:11:46.900 --> 00:11:47.770 what are the units, 00:11:47.770 --> 00:11:50.260 what's the unit of the independent variable, 00:11:50.260 --> 00:11:52.270 what's the unit of the dependent variable and 00:11:52.270 --> 00:11:54.970 what is the context that you're  evaluating the effect in? 00:11:54.970 --> 00:12:01.150 What if we say that one year increase leads  to one Bitcoin increase in annual salary? 00:12:01.150 --> 00:12:03.940 So we get one additional year of education, 00:12:03.940 --> 00:12:06.790 and we get one Bitcoin per year more. 00:12:06.790 --> 00:12:10.690 Well, that's more problematic, 00:12:10.690 --> 00:12:15.910 because people don't have an  intuitive understanding of 00:12:15.910 --> 00:12:17.170 what is the value of Bitcoin? 00:12:17.170 --> 00:12:19.750 So obviously when you say someone, 00:12:19.750 --> 00:12:21.100 tell somebody that 00:12:21.100 --> 00:12:22.030 I'll give you a Bitcoin. 00:12:22.030 --> 00:12:24.370 Then the first question they'll ask, 00:12:24.370 --> 00:12:26.290 what's the value of Bitcoin in Euros? 00:12:26.290 --> 00:12:30.100 So, in this case, we could convert  the value of Bitcoin to Euro, 00:12:30.100 --> 00:12:31.300 so we can do a conversion 00:12:31.300 --> 00:12:37.180 and express the regression coefficient  in a way that's more understandable. 00:12:37.180 --> 00:12:39.880 Let's say that one year increase leads to 00:12:39.880 --> 00:12:42.460 three thousand increase in annual salary. 00:12:42.460 --> 00:12:46.090 I don't know what is the value of Bitcoin now 00:12:46.090 --> 00:12:47.920 but let's assume it's three thousand euros, 00:12:47.920 --> 00:12:51.280 so then we know that it's probably  a big deal for some people. 00:12:51.280 --> 00:12:55.210 So sometimes we can convert the units  to something that we can understand, 00:12:55.210 --> 00:12:58.990 even if the original unit was something  that we don't understand easily. 00:12:58.990 --> 00:13:05.110 What if we have a case of a  unit that cannot be converted? 00:13:05.110 --> 00:13:11.980 So let's say that, result is  that one year increase leads   00:13:11.980 --> 00:13:14.140 to one Buckazoid increase in annual salary. 00:13:14.140 --> 00:13:18.010 Buckazoid is a fictional  currency in a computer game, 00:13:18.010 --> 00:13:22.420 and I don't think that anyone has ever developed 00:13:22.420 --> 00:13:24.460 an exchange rate from Buckazoids to euros. 00:13:24.460 --> 00:13:28.450 So we can't convert this effect  into euros, so what do we do? 00:13:28.450 --> 00:13:35.590 One way of dealing with this Buckazoid issue is 00:13:35.590 --> 00:13:37.960 that we have to first understand, 00:13:37.960 --> 00:13:41.980 what's the average salary in Buckazoids, 00:13:41.980 --> 00:13:43.240 in this fictional universe. 00:13:43.240 --> 00:13:48.430 And also how much are the salaries dispersed. 00:13:48.430 --> 00:13:50.950 If we say that I'll give you ten Buckazoids 00:13:50.950 --> 00:13:52.990 or I'll give you a million Buckazoids, 00:13:52.990 --> 00:13:54.250 it doesn't really make sense 00:13:54.250 --> 00:13:57.850 unless we know, what's the mean income. 00:13:57.850 --> 00:14:02.440 If we know that the mean income in  that fictional world is ten Buckazoids, 00:14:02.440 --> 00:14:05.980 if we tell somebody that you'll  get a million Buckazoids, 00:14:05.980 --> 00:14:07.900 then a million Buckazoids is probably a lot. 00:14:07.900 --> 00:14:12.790 If we tell them that we give  you a million Buckazoids 00:14:12.790 --> 00:14:15.310 and the annual income is a billion Buckazoids, 00:14:15.310 --> 00:14:17.770 then not a big deal, as much. 00:14:17.770 --> 00:14:22.210 To understand how the variable varies, 00:14:22.210 --> 00:14:24.670 we have to look at its mean  and standard deviations. 00:14:24.670 --> 00:14:28.270 And it's useful in this case  when we have these variables 00:14:28.270 --> 00:14:29.530 that don't have any units, 00:14:29.530 --> 00:14:32.890 any naturally interpretable units, 00:14:32.890 --> 00:14:35.080 look at okay how is it distributed. 00:14:35.080 --> 00:14:37.720 So we take a look at mean and standard deviation. 00:14:37.720 --> 00:14:40.510 Let's assume that in our sample 00:14:40.510 --> 00:14:42.670 the income in Buckazoids is distributed normally. 00:14:42.670 --> 00:14:48.250 A normal distribution implies  that one standard deviation, 00:14:48.250 --> 00:14:50.440 two standard deviations from the mean, 00:14:50.440 --> 00:14:54.940 have special interpretation. 00:14:54.940 --> 00:14:56.770 So in normal distribution, 00:14:56.770 --> 00:15:03.400 68 % of observations are plus or minus  one standard deviation above the mean. 00:15:03.400 --> 00:15:09.730 So if we say that our income is one  standard deviation above the mean, 00:15:09.730 --> 00:15:13.750 then we know that we are solidly  in the high-income segments, 00:15:13.750 --> 00:15:17.230 so we are pretty well above the average. 00:15:17.230 --> 00:15:22.450 If we say that our income is two standard  deviations, in Buckazoids, above the mean, 00:15:22.450 --> 00:15:30.040 then we know that we are in the top  2.5 % of the income distribution. 00:15:30.040 --> 00:15:33.670 We can also see that generally the effect of 00:15:34.270 --> 00:15:38.320 one standard deviation increase is pretty big. 00:15:38.320 --> 00:15:41.170 So you're solidly here below mean, 00:15:41.170 --> 00:15:43.300 one standard deviation takes you to the average. 00:15:43.300 --> 00:15:46.360 Then two standard deviations, you are pretty rich, 00:15:46.360 --> 00:15:49.420 so you are in the top 2.5 %. 00:15:49.420 --> 00:15:55.510 So standard deviation units can be useful  for interpreting regression analysis results. 00:15:55.510 --> 00:16:01.060 So if we say that one additional  year of education increases 00:16:01.060 --> 00:16:07.600 your income by one standard  deviation in the Buckazoid units, 00:16:07.600 --> 00:16:09.340 is it a large effect? 00:16:09.340 --> 00:16:11.920 Well, then we would have to, 00:16:11.920 --> 00:16:12.880 for people it is, 00:16:12.880 --> 00:16:14.200 but we would have to think, 00:16:14.200 --> 00:16:16.210 what is the lifespan of these aliens? 00:16:16.210 --> 00:16:19.060 If they only live on average one year, 00:16:19.060 --> 00:16:23.110 then a one-year investment in  education is a huge deal for them. 00:16:23.110 --> 00:16:24.880 So we have to think about the context again. 00:16:24.880 --> 00:16:28.750 Let's take a look at an empirical example. 00:16:28.750 --> 00:16:30.580 So this is the Deephouses paper, 00:16:30.580 --> 00:16:33.760 and table two and model two  from the regression results. 00:16:33.760 --> 00:16:38.170 And we'll be interpreting these  purely through standard deviations. 00:16:38.710 --> 00:16:43.300 The dependent variable ROA has a meaningful unit, 00:16:43.300 --> 00:16:44.860 but we'll just ignore it for now. 00:16:44.860 --> 00:16:47.020 So we'll just be looking at standard deviations. 00:16:47.020 --> 00:16:51.070 Their regression coefficient was -0.02 00:16:51.070 --> 00:16:55.390 for the effect of strategic deviation  or relative return on assets. 00:16:55.390 --> 00:16:57.220 So is it a big effect? 00:16:57.220 --> 00:16:59.050 To understand that, 00:16:59.050 --> 00:17:00.760 we would need to understand 00:17:00.760 --> 00:17:03.130 what is the unit of strategic deviation, 00:17:03.130 --> 00:17:05.590 that's a completely made-up number by them, 00:17:05.590 --> 00:17:06.580 so it doesn't have a meaning, 00:17:06.580 --> 00:17:09.790 ROA has a meaning, but we'll  just ignore it for now. 00:17:09.790 --> 00:17:11.260 We need to know, 00:17:11.260 --> 00:17:13.840 what are the standard  deviation of these variables? 00:17:13.840 --> 00:17:17.200 So the standard deviation of ROA is 0.7 00:17:17.200 --> 00:17:21.880 and a standard deviation of  strategic deviation is 2.9. 00:17:21.880 --> 00:17:23.350 That tells us that, 00:17:23.350 --> 00:17:27.610 if the data are normally distributed, 00:17:27.610 --> 00:17:38.020 then 95% of the observations of  ROA are plus or minus 1.4 units, 00:17:38.020 --> 00:17:40.360 that's two standard deviations from the mean. 00:17:40.360 --> 00:17:47.290 The difference between the top 2.5 %  and bottom 2.5 % is then 2.8 units. 00:17:47.290 --> 00:17:51.380 So top 2.5, bottom 2.5, four standard deviations, 00:17:51.380 --> 00:17:54.860 it's a 2.8 units. 00:17:54.860 --> 00:17:58.640 So what is the effect of strategic deviation? 00:17:58.640 --> 00:18:04.430 the strategic deviation, one standard  deviation increase of strategic deviation is 00:18:04.430 --> 00:18:12.110 then 2.932 multiplied by -0.020, 00:18:12.110 --> 00:18:16.910 which equals -0,058 decrease in relative ROA. 00:18:16.910 --> 00:18:20.480 Then we compare is this -0.058, 00:18:20.480 --> 00:18:23.780 is it larger than the 2.8 units? 00:18:23.780 --> 00:18:25.790 So the full-scale is from the -2.5 % to the, 00:18:25.790 --> 00:18:33.440 from the worst 2.5 % to the  best 2.5 % is 2.8 units, 00:18:33.440 --> 00:18:38.690 and if you increase your strategic  deviation by one standard deviation, 00:18:38.690 --> 00:18:44.540 you get -0.058 decrease in ROA. 00:18:44.540 --> 00:18:47.000 So it's a smallish effect. 00:18:47.000 --> 00:18:52.370 We can also understand the  effects of interpretation and 00:18:52.370 --> 00:18:57.080 how it's reported while looking at  this a nice example about Sauna. 00:18:57.080 --> 00:19:01.730 So when we ask whether the sauna is warm or not, 00:19:01.730 --> 00:19:03.890 sauna is a Finnish thing. 00:19:03.890 --> 00:19:08.720 A normal research paper would say that 00:19:08.720 --> 00:19:12.110 the temperature of the sauna  is statistically significantly   00:19:12.110 --> 00:19:13.130 different from normal room temperature. 00:19:15.950 --> 00:19:17.450 It tells us that maybe 00:19:17.450 --> 00:19:19.340 the sauna is heating, 00:19:19.340 --> 00:19:21.800 maybe it's ready for going in, 00:19:21.800 --> 00:19:22.820 maybe it's too hot, 00:19:22.820 --> 00:19:25.850 maybe it was on a day before  and it's still cooling. 00:19:25.850 --> 00:19:30.110 It doesn't really tell us anything  about whether the sauna is warm or not. 00:19:30.110 --> 00:19:33.890 And that's equivalent of saying that 00:19:33.890 --> 00:19:35.690 the effect of strategic deviation ROA 00:19:35.690 --> 00:19:39.440 is negatively and statistically  significantly different from zero. 00:19:39.440 --> 00:19:42.050 So the statistical significance just tells that 00:19:42.770 --> 00:19:43.730 there is some effect, 00:19:43.730 --> 00:19:47.000 it doesn't tell us whether  the effect is large not. 00:19:47.000 --> 00:19:52.040 Then even better, a slightly better answer is that 00:19:52.040 --> 00:19:54.590 the temperature of the sauna  is currently 80 degrees 00:19:54.590 --> 00:20:02.000 and comparable that the effect of  strategic deviation of ROA is -0.020. 00:20:02.000 --> 00:20:05.970 So that is useful for people who understand 00:20:05.970 --> 00:20:10.350 what 80 degrees mean and what this - 0.020 means. 00:20:10.350 --> 00:20:13.890 So for most people, who go to sauna often, 00:20:13.890 --> 00:20:15.420 know what 80 centigrades mean, 00:20:15.420 --> 00:20:18.990 but you can't assume that the  readers of your research study 00:20:18.990 --> 00:20:20.370 will understand your units, 00:20:20.370 --> 00:20:21.960 so you have to explain what it means. 00:20:21.960 --> 00:20:25.830 So a really good answer to  whether the sauna is hot is 00:20:25.830 --> 00:20:28.050 to say that the temperature is currently 80 00:20:28.050 --> 00:20:31.200 and then tell that most people  who go to the sauna regularly 00:20:31.200 --> 00:20:34.110 would say that the sauna is too  hot but they could still do it. 00:20:34.110 --> 00:20:37.080 So that quantifies that the sauna is pretty hot, 00:20:37.080 --> 00:20:41.190 more so than just saying that it's 80 centigrades. 00:20:41.190 --> 00:20:42.510 The same thing, 00:20:42.510 --> 00:20:46.140 you can say that the effect of ROA is -0.20, 00:20:46.140 --> 00:20:50.520 and the difference between ROAs  of top 25 % and bottom 25 %, 00:20:50.520 --> 00:20:56.130 for standard deviation is -12, 00:20:56.130 --> 00:21:01.740 so if you go from the least deviant  to the most deviant is 0.12, 00:21:01.740 --> 00:21:06.540 and the same scale for the ROA is 2.8, 00:21:06.540 --> 00:21:12.990 so we can see that 0.12 is  pretty small compared to 2.8, 00:21:12.990 --> 00:21:14.790 so the effect is quite small. 00:21:14.790 --> 00:21:20.880 There are other things that you can  do to improve your profitability, 00:21:20.880 --> 00:21:23.940 than to be more statistically deviant. 00:21:23.940 --> 00:21:29.340 Let's take a look at yet another example, 00:21:29.340 --> 00:21:32.010 so this is from Hekman's paper. 00:21:32.010 --> 00:21:35.820 And Hekman's paper shows a regression table and 00:21:35.820 --> 00:21:42.540 now, these effects are the  number of patients in a panel, 00:21:42.540 --> 00:21:46.620 so how many people go to see a doctor is -0.04, 00:21:46.620 --> 00:21:53.100 and the age of the doctor is  -0.13, the regression coefficients. 00:21:53.100 --> 00:21:54.780 Are these large effects or not? 00:21:54.780 --> 00:22:01.350 We would have to look at the correlation  table and standard deviations 00:22:01.350 --> 00:22:02.400 and means to understand 00:22:02.400 --> 00:22:04.710 what are these large effects in a normal case? 00:22:04.710 --> 00:22:08.670 But this is actually not a normal case, 00:22:08.670 --> 00:22:11.850 because these are standardized  regression coefficients. 00:22:11.850 --> 00:22:12.720 They don't report it, 00:22:12.720 --> 00:22:14.790 but you can see it by comparing, 00:22:14.790 --> 00:22:19.200 if you start to interpret this effect  of the number of patients in the panel, 00:22:19.200 --> 00:22:20.250 which is in the thousands, 00:22:20.250 --> 00:22:22.260 and age, which is in the tens. 00:22:22.260 --> 00:22:25.380 You can see that the effect  sizes don't make any sense. 00:22:25.380 --> 00:22:30.930 Also, all these effects are  varied between plus or minus 1, 00:22:30.930 --> 00:22:35.010 which is the typical range for a  standardized regression coefficient. 00:22:35.010 --> 00:22:36.480 They can be more or less, 00:22:36.480 --> 00:22:40.830 but they are typically zero point  something or minus zero point something. 00:22:40.830 --> 00:22:44.640 So these are standardized coefficients, 00:22:44.640 --> 00:22:47.940 which means that the data have been standardized. 00:22:47.940 --> 00:22:51.870 So every variable has a standard  deviation of 1 and a mean of 0, 00:22:51.870 --> 00:22:53.250 before regression estimation. 00:22:53.250 --> 00:22:58.710 In that case, we estimate this  directly as standard deviations. 00:22:58.710 --> 00:23:02.640 One unit increase in physician  productivity is associated with   00:23:02.640 --> 00:23:04.260 beta 1 increase in patient satisfaction. 00:23:04.260 --> 00:23:06.780 So we say that these are, 00:23:06.780 --> 00:23:09.777 one standard deviation increase  in physician productivity 00:23:09.777 --> 00:23:15.030 is associated with one standardized  increase in satisfaction. 00:23:15.030 --> 00:23:19.170 So we interpret directly as standard deviations. 00:23:19.170 --> 00:23:21.480 This looks like the way to do it always, 00:23:21.480 --> 00:23:26.160 so it would simplify life to  always use standardized estimates, 00:23:26.160 --> 00:23:27.900 but that's actually not the case. 00:23:27.900 --> 00:23:33.180 I recommend that you never standardize  a variable that has a meaningful scale. 00:23:33.180 --> 00:23:39.450 So if you have euros or years or something  that makes sense to people as a unit, 00:23:39.450 --> 00:23:40.560 then don't standardize. 00:23:40.560 --> 00:23:42.480 The reason for that is that, 00:23:42.480 --> 00:23:48.180 standardized estimates depend  on the scale of the variables, 00:23:48.180 --> 00:23:51.300 because the standard deviation  is a sample standard deviation. 00:23:51.300 --> 00:23:53.190 So let's say that here 00:23:53.190 --> 00:23:57.210 the standard deviation of age is 6.58 00:23:57.210 --> 00:23:59.340 and the mean is 50.34, 00:24:00.120 --> 00:24:01.920 so the doctors are quite old. 00:24:01.920 --> 00:24:07.380 What would happen if the doctors in this  sample were actually newly graduated, 00:24:07.380 --> 00:24:09.480 between 24 and 28, 00:24:09.480 --> 00:24:11.550 and the standard deviation would be 1? 00:24:11.550 --> 00:24:15.180 What would happen is 00:24:15.180 --> 00:24:18.960 that the standardized regression  coefficient for the same effect 00:24:18.960 --> 00:24:22.710 would be only -0.02, 00:24:22.710 --> 00:24:29.490 which has a very different  interpretation from -0.14. 00:24:29.490 --> 00:24:31.470 So it's 7 times as small, 00:24:31.470 --> 00:24:35.460 it's the exact same effect,  it's just scaled differently. 00:24:35.460 --> 00:24:39.420 So the differential scaling  means that these effects 00:24:39.420 --> 00:24:43.500 0.02 and 0.40 are not comparable, 00:24:43.500 --> 00:24:48.390 so standardization doesn't  make your results comparable. 00:24:48.390 --> 00:24:52.950 So if you can interpret the  results without standardization, 00:24:52.950 --> 00:24:54.630 it is always better to do so. 00:24:54.630 --> 00:24:59.100 So a rule of thumb, use standardization only 00:24:59.100 --> 00:25:04.650 if your variables, none of  them have a natural scale, 00:25:04.650 --> 00:25:08.190 otherwise, interpret the standard deviations units 00:25:08.190 --> 00:25:12.420 only for those variables for which  a natural scale does not exist.