WEBVTT 00:00:00.060 --> 00:00:04.170 Because experiments are not always  feasible in business research, 00:00:04.170 --> 00:00:08.700 we do statistical controlling  for alternative explanations. 00:00:08.700 --> 00:00:12.000 So that's our second strategy  for making causal claims. 00:00:12.000 --> 00:00:15.240 The idea of statistical controls is to 00:00:15.240 --> 00:00:19.560 introduce alternatively plausible  explanations to your analysis. 00:00:19.560 --> 00:00:24.960 So instead of just comparing men-led  companies against women-led companies, 00:00:24.960 --> 00:00:29.490 we introduce possible confounding  factors to the analysis. 00:00:29.490 --> 00:00:36.060 For example, we could say that this difference  between men and women-led companies, 00:00:36.060 --> 00:00:39.840 this just arbitrary values here, 00:00:39.840 --> 00:00:44.280 this difference here is not  because of the gender differences, 00:00:44.280 --> 00:00:50.040 instead, we could claim that  it's a company size difference. 00:00:50.040 --> 00:00:56.010 so that, there is actually this overlap  between gender and performance, here 00:00:56.010 --> 00:01:00.000 the correlation, is partially caused by gender, 00:01:00.000 --> 00:01:03.480 but it's partially also because of, 00:01:03.480 --> 00:01:08.370 smaller companies are more  likely to hire women as CEOs, 00:01:08.370 --> 00:01:11.130 and smaller companies are more profitable. 00:01:11.130 --> 00:01:14.670 So we say that this relationship  between gender and performance 00:01:14.670 --> 00:01:22.440 is at least partially explained by  size being a factor in CEO decisions, 00:01:22.440 --> 00:01:25.769 and size being a factor in influencing performance. 00:01:26.880 --> 00:01:33.000 So, how do we take that kind of  control variable into account? 00:01:33.000 --> 00:01:35.130 There are a couple of different ways. 00:01:35.130 --> 00:01:41.190 One intuitive way is an instance of  a general strategy called matching, 00:01:41.190 --> 00:01:44.160 so we try to make the samples more comparable. 00:01:44.160 --> 00:01:52.560 So let's assume that there are only a few  women-led companies with more than 250 people, 00:01:52.560 --> 00:01:57.390 and most women-led companies  have 250 people or less. 00:01:57.390 --> 00:02:02.310 We could make the samples more  comparable by dropping large companies. 00:02:02.310 --> 00:02:07.350 So we only focus on medium-sized  companies with 250 people or less. 00:02:07.350 --> 00:02:10.650 And that would be a more fair comparison. 00:02:10.650 --> 00:02:19.210 And if size actually is a factor that influences  both gender, the CEO selection and performance, 00:02:19.210 --> 00:02:26.200 then these kinds of more comparable samples should give us a smaller performance difference, 00:02:26.200 --> 00:02:28.630 which they do in this case. 00:02:28.630 --> 00:02:31.780 So we can see that when we make  the samples more comparable, 00:02:31.780 --> 00:02:36.370 the difference is 1.4 instead of 4.7. 00:02:37.303 --> 00:02:42.550 Matching is an intuitive way of  understanding statistical controlling, 00:02:42.550 --> 00:02:45.490 but it's not a practical  strategy for a couple of reasons. 00:02:45.490 --> 00:02:50.260 First of all, when you have multiple  different things that you want to control for, 00:02:50.260 --> 00:02:53.830 then constructing this kind of matched sample, 00:02:53.830 --> 00:02:55.870 in this kind of simple strategy, 00:02:56.770 --> 00:02:58.720 it's not a viable option anymore, 00:02:58.720 --> 00:03:03.580 because you cannot have exactly  the same companies in both samples. 00:03:03.580 --> 00:03:07.660 So once the factors to be controlled increase then 00:03:08.380 --> 00:03:14.980 it's not possible to construct two samples  that are comparable on all those factors. 00:03:14.980 --> 00:03:17.440 So to take that into consideration, 00:03:17.440 --> 00:03:20.140 we don't normally apply matching 00:03:20.140 --> 00:03:22.930 instead, we apply a statistical model. 00:03:22.930 --> 00:03:32.020 So we say that return on assets  depends on CEO gender and company size, 00:03:32.020 --> 00:03:39.100 so that we can express return on asset as  a linear function of CEO gender and size, 00:03:39.100 --> 00:03:40.943 so we multiply CEO gender, 00:03:40.943 --> 00:03:42.867 female is one, male is zero, 00:03:42.867 --> 00:03:46.930 and company size, we multiply  that with another variable beta 2, 00:03:46.930 --> 00:03:56.380 and then we ask the computer to give us some estimates for these beta 0, beta 1 and beta 2, 00:03:56.380 --> 00:04:00.190 so that we can predict the return  on assets as well as possible. 00:04:00.190 --> 00:04:02.290 And a computer will do that for us, 00:04:02.290 --> 00:04:04.900 then we interpret the results to see 00:04:04.900 --> 00:04:07.660 whether the gender effect actually exists. 00:04:08.553 --> 00:04:13.960 Either way, regardless of how we actually  implement this statistical controlling, 00:04:13.960 --> 00:04:18.130 we need to decide, which  factors we need to control for. 00:04:18.130 --> 00:04:23.170 And the factors that we control  for are called control variables. 00:04:23.170 --> 00:04:29.560 So control variables are present in  nearly every study in business research. 00:04:29.560 --> 00:04:34.420 It's quite often that you actually  see a section in the paper 00:04:34.420 --> 00:04:39.550 that is explicitly labelled as control  variables like here in the Hekman's paper, 00:04:39.550 --> 00:04:41.350 that we use as an example. 00:04:41.350 --> 00:04:47.620 So control variance is alternative explanations or alternative theories for the data. 00:04:47.620 --> 00:04:53.380 If we say that the women-led companies  are more profitable the men-led companies, 00:04:53.380 --> 00:04:57.126 we have to think really  hard, why is that the case? 00:04:57.126 --> 00:05:01.660 Then we have all kinds of reasons, plausible  reasons that we could come up with: 00:05:01.660 --> 00:05:05.890 the size effect, industry  effect, selection effect, 00:05:05.890 --> 00:05:09.218 and then we include those into the same model. 00:05:09.555 --> 00:05:12.400 So we say that we have interests  in the independent variable, 00:05:12.400 --> 00:05:16.240 that we assume to influence  the dependent variable, 00:05:16.240 --> 00:05:19.660 and we also have control  variables in the same model 00:05:19.660 --> 00:05:25.030 and we kind of like put these variables  together to compete against one another to see, 00:05:25.030 --> 00:05:29.020 which one of them actually  explains the dependent variable, 00:05:29.020 --> 00:05:30.610 return on assets in this case. 00:05:31.523 --> 00:05:38.560 So it's important that the control  variables are selected based on theory, 00:05:38.560 --> 00:05:42.760 instead of just throwing in a  standard set of gender and age, 00:05:42.760 --> 00:05:45.850 if we have people or industrial  revenue if we have companies. 00:05:45.850 --> 00:05:51.820 So you need to choose them carefully  to rule out alternative explanations 00:05:51.820 --> 00:05:55.240 and it's important that you justify, 00:05:55.240 --> 00:05:59.620 why you think that the control  variable is related to both, 00:05:59.620 --> 00:06:02.320 your independent variable  and the dependent variable. 00:06:02.320 --> 00:06:07.510 One common thing that I see in articles  and which I complain about as reviewer, 00:06:07.510 --> 00:06:10.510 is that the authors generally only justify the   00:06:10.510 --> 00:06:13.420 relations between the control  and the dependent variable, 00:06:13.420 --> 00:06:17.230 but it's almost as important that you justify, 00:06:17.230 --> 00:06:21.610 why you think that the control and  the interesting independent variable, 00:06:21.610 --> 00:06:23.170 CEO gender in this case, 00:06:23.170 --> 00:06:24.880 are correlated. 00:06:24.880 --> 00:06:28.346 Let's take a look at an example, 00:06:28.723 --> 00:06:35.440 so we have the article by Deephouse and  they have a variable called market share. 00:06:35.440 --> 00:06:40.160 So is market share an interesting,  a good control variable, 00:06:40.160 --> 00:06:42.800 based on this correlation matrix? 00:06:44.960 --> 00:06:47.900 To understand whether it's a good  control variable empirically, 00:06:47.900 --> 00:06:50.330 we have to look at certain correlations, 00:06:50.330 --> 00:06:58.250 so market share is a relevant control variable if it's correlated with the key independent variable, 00:06:58.250 --> 00:07:05.450 and we are looking at the effects of  strategic deviation, variable number four, 00:07:05.450 --> 00:07:08.870 on relative return on assets, variable number one. 00:07:08.870 --> 00:07:16.520 So we need to take a look at the correlations of market share with variable one and variable four. 00:07:16.520 --> 00:07:17.630 So we are here, 00:07:17.630 --> 00:07:25.850 market share is weakly and negatively  correlated with return on assets 00:07:25.850 --> 00:07:30.320 and it's very strongly correlated  with strategic deviation. 00:07:30.320 --> 00:07:34.760 That would suggest that we can't infer, 00:07:34.760 --> 00:07:38.480 whether there is or is not a causal  relationship based on a correlation. 00:07:38.480 --> 00:07:44.390 But this strong correlation  raises the question that, 00:07:44.390 --> 00:07:48.766 if market share has an effect on ROA, 00:07:48.766 --> 00:07:55.580 Then because it's correlated with  strategic deviation variable, 00:07:55.580 --> 00:07:58.040 it could create a spurious correlation. 00:07:58.496 --> 00:08:01.340 So market share is relevant to control, 00:08:01.340 --> 00:08:05.840 if we have theoretical reasons that  return on assets depends on market share. 00:08:08.003 --> 00:08:10.496 Let's take a look at the actual modelling results. 00:08:11.250 --> 00:08:14.090 So this is based on Deephouse's paper, 00:08:14.090 --> 00:08:18.740 so they say that market share has a  negative effect on return on assets, 00:08:18.740 --> 00:08:20.840 so that when your market share goes up, 00:08:20.840 --> 00:08:22.790 return on assets goes down. 00:08:22.790 --> 00:08:26.510 And compared to the other effects in the article, 00:08:26.510 --> 00:08:29.420 this is an OK, a very large effect. 00:08:29.420 --> 00:08:34.430 The effect of strategic  deviation is -0.02 so it's small, 00:08:34.430 --> 00:08:38.120 you can't compare directly but we  will do that for convenience now. 00:08:38.120 --> 00:08:42.650 And they are highly correlated,  so what will happen, 00:08:42.650 --> 00:08:47.930 what is the interpretation of this figure? 00:08:47.930 --> 00:08:49.370 The interpretation is that, 00:08:49.370 --> 00:08:53.870 larger firms, firms with more market share, 00:08:53.870 --> 00:08:56.714 are more strategically deviant, 00:08:56.714 --> 00:08:58.378 according to their definition. 00:08:58.894 --> 00:09:02.123 Larger firms are also less profitable 00:09:02.123 --> 00:09:06.896 and these two relationships  cause a spurious relationship. 00:09:07.670 --> 00:09:14.267 If larger firms are more deviant  and larger firms have smaller ROA, 00:09:14.267 --> 00:09:19.280 it means that if this effect  was not controlled for, 00:09:19.280 --> 00:09:23.060 then we would get a very different  estimate for strategic deviation. 00:09:24.291 --> 00:09:27.216 If we don't control market share, 00:09:27.230 --> 00:09:29.660 then this effect here will be inflated, 00:09:29.660 --> 00:09:34.760 because it confounds the effect of  market share and strategic deviation. 00:09:34.760 --> 00:09:37.667 So let's assume we leave market share out, 00:09:38.104 --> 00:09:42.230 then our estimate of strategic deviation would be 00:09:42.230 --> 00:09:47.900 the actual direct effect of strategic  deviation and also the effect of size, 00:09:47.900 --> 00:09:50.420 because size is correlated with deviation. 00:09:50.618 --> 00:09:56.846 So the effect would be -0.058, or  three times as large as before. 00:09:57.779 --> 00:10:01.610 So omitting the important  control variable would have 00:10:01.610 --> 00:10:04.400 a serious consequence for the modelling results 00:10:04.400 --> 00:10:08.126 And in this case, it will  result in omitted variable bias, 00:10:08.603 --> 00:10:13.699 which makes the estimate three times  as large as it otherwise would be, 00:10:14.572 --> 00:10:17.360 assuming that the model is  otherwise correctly specific. 00:10:17.360 --> 00:10:20.270 So dealing with controls, 00:10:20.270 --> 00:10:25.010 because the controls are so  important for your causal claims, 00:10:25.010 --> 00:10:27.470 you should take it very seriously, 00:10:27.470 --> 00:10:29.930 which variables you include and really think, 00:10:29.930 --> 00:10:35.870 what kind of alternative explanations  there are for the observed associations, 00:10:35.870 --> 00:10:38.960 or the association that you expect to observe. 00:10:41.638 --> 00:10:46.700 Statistical controls and experimental  approaches can be compared. 00:10:46.700 --> 00:10:48.860 So in the experiments, 00:10:48.860 --> 00:10:53.150 you have treatment and control  groups that you assign yourself, 00:10:53.150 --> 00:10:54.290 and you apply the treatment, 00:10:54.290 --> 00:10:56.346 so you have full control over the study. 00:10:56.346 --> 00:10:59.990 And the groups are perfectly  comparable to start with, 00:10:59.990 --> 00:11:01.568 because of randomization, 00:11:01.568 --> 00:11:06.680 and if after treatment there is a  difference between the two groups, 00:11:07.370 --> 00:11:11.750 then we can make a claim that the  difference is because of the treatment, 00:11:11.750 --> 00:11:13.558 so that's fairly simple. 00:11:14.133 --> 00:11:18.740 In statistical controls, we don't  have control over the cases, 00:11:18.740 --> 00:11:21.200 so we are just passive observers of what happens. 00:11:21.894 --> 00:11:26.420 And the only way we can rule out  alternative explanations is to 00:11:26.420 --> 00:11:28.550 think based on existing theory, 00:11:28.550 --> 00:11:34.460 what kind of other plausible  explanations there is for an association, 00:11:34.460 --> 00:11:38.600 and then we rule them out using  control variables in our analysis.