WEBVTT WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:04.050 Normally when we take a sum of different  indicators or mean of different indicators   00:00:04.050 --> 00:00:08.190 we assume that those indicators  are uni-dimensional measures of   00:00:08.190 --> 00:00:14.370 one thing or one concept and the purpose of  taking a sum is to create a more reliable   00:00:14.370 --> 00:00:18.990 composite measure than any of the individual  components that go into the composite are. 00:00:18.990 --> 00:00:22.560 There are also other reasons to take sums or   00:00:22.560 --> 00:00:27.210 weighted sums or means of indicators  and that is called index constructs. 00:00:27.210 --> 00:00:34.290 When and why would you want to use indices instead  of raw variables is the topic of this video. 00:00:34.290 --> 00:00:37.920 To understand indices we have to first understand   00:00:37.920 --> 00:00:40.590 what is a scale variable and  what is a non scale variable. 00:00:40.590 --> 00:00:46.260 I refer to as a scale variable a variable that  is part of a measurement scale. For example these   00:00:46.260 --> 00:00:53.460 three questions here form a survey scale that is  supposed to measure company's innovativeness. This   00:00:53.460 --> 00:00:58.440 is a uni-dimensional scale or at least it's  supposed to be a uni-dimensional scale and   00:00:58.440 --> 00:01:04.230 what that means that the items measure the same  quantity and the items are highly correlated.   00:01:04.230 --> 00:01:10.860 So we expect a highly innovative company  to answer highly on all these items here. 00:01:10.860 --> 00:01:16.530 And there is also another assumption if  or the implication of these assumptions   00:01:16.530 --> 00:01:21.660 it's that if these items were  perfectly reliable then the   00:01:21.660 --> 00:01:27.150 normal way we model these items would lead  to the items to be perfectly correlated. 00:01:27.150 --> 00:01:32.940 In other words a scale variable - two scale  variables are assumed to be uncorrelated   00:01:32.940 --> 00:01:37.500 only because they are unreliable. This  can be of course relaxed by using more   00:01:37.500 --> 00:01:41.460 advanced modeling techniques but this is  the common factor analysis assumption. 00:01:41.460 --> 00:01:45.900 There are of course items that are variables that   00:01:45.900 --> 00:01:51.090 don't follow these assumptions. So we -  I call them non-scale variables. So all   00:01:51.090 --> 00:01:55.140 other variables that are not scale  variables are non-scale variables. 00:01:55.140 --> 00:02:00.420 And items could be measuring distinct  quantities for example person's height   00:02:00.420 --> 00:02:05.730 or weight. And items may correlate  highly or they may not correlate highly. 00:02:05.730 --> 00:02:12.540 Typical examples of useful non-scale variables  - where we would make an index - are alcohol   00:02:12.540 --> 00:02:17.380 consumption. So the alcohol consumption  is the sum of the amount of beer wine   00:02:17.380 --> 00:02:23.230 and hard liquor that you drink and these  three categories are probably not very   00:02:23.230 --> 00:02:28.030 highly correlated because some people  tend to favor beer over wine and some   00:02:28.030 --> 00:02:32.800 people tend to favor wine over beer and not  many people drink hard liquor for example. 00:02:32.800 --> 00:02:36.580 So that's one category of variables where it makes   00:02:36.580 --> 00:02:40.660 sense to do an index. Another one is  that the variables that go into the   00:02:40.660 --> 00:02:44.650 index are manifestation - different kind  of manifestations of the same behavior. 00:02:44.650 --> 00:02:50.860 So for if a company wants to do supply  chain redesign for example. That can be   00:02:50.860 --> 00:02:55.990 done in a multiple different ways. So  if we measure for example whether the   00:02:55.990 --> 00:03:00.910 company has taken any of let's say 15  different actions that they could do   00:03:00.910 --> 00:03:06.040 the reduce their supply chain then are  taking an index of those indicators to   00:03:06.040 --> 00:03:11.950 measure the overall degrees of supply chain  redesign would be a reasonable thing to do. 00:03:11.950 --> 00:03:19.120 So how do you justify indices  and when would you use indices? 00:03:19.120 --> 00:03:24.070 Woodridge in his econometrics book provides a  good example of when you would use an index.   00:03:24.070 --> 00:03:30.550 In his example he is using schools and  different expenditure categories. The   00:03:30.550 --> 00:03:35.650 expenditure categories - how the school spends  their money - are correlated because schools   00:03:35.650 --> 00:03:40.600 that have more money spend more and then  schools that are poor don't spend as much. 00:03:40.600 --> 00:03:45.820 If we have let's say 20 different  ways that schools can spend money   00:03:45.820 --> 00:03:52.810 and then we have let's say a hundred schools  running a regression analysis would be a bit   00:03:52.810 --> 00:03:59.770 problematic because of multicolinearity. So  the idea would be that because we have so   00:03:59.770 --> 00:04:04.300 many categories we can't really say with  the any of the matters independently. 00:04:04.300 --> 00:04:08.530 What if we are interesting in  the more general question does   00:04:08.530 --> 00:04:13.990 the school expenditure how much  money you spend overall and not   00:04:13.990 --> 00:04:18.970 whether it's in a specific category. How  does that influence student performance? 00:04:18.970 --> 00:04:22.030 Then we would take - we could  take a sum of all the spending   00:04:22.030 --> 00:04:26.440 categories and then use that as an  explanatory variable in a regression   00:04:27.310 --> 00:04:31.570 analysis with the student performance as the  dependent variable. That wouldn't make a lot of   00:04:31.570 --> 00:04:37.780 sense unless we are specifically interested in a  particular category and how much that contributes. 00:04:37.780 --> 00:04:42.610 So when our research interest is  in this higher level concept like   00:04:42.610 --> 00:04:46.660 spending instead of spending in  a particular category and when we   00:04:46.660 --> 00:04:51.790 have a small sample size then taking an  index would be a reasonable thing to do. 00:04:51.790 --> 00:04:58.270 Of course if we had a million schools in our  sample which it's unrealistic but if we have   00:04:58.270 --> 00:05:04.570 that then modeling each of these expenditure  categories a separate explanatory variables in   00:05:04.570 --> 00:05:08.350 the regression analysis would be possible  and that would be the ideal thing to do. 00:05:08.350 --> 00:05:14.410 When we do indices there are a couple  of statistical assumptions that we make   00:05:14.410 --> 00:05:17.680 and then we have to decide whether  those assumptions are reasonable. 00:05:17.680 --> 00:05:24.940 So let's consider that we have an index  C defined as a sum of x1 x2 and x3 for   00:05:24.940 --> 00:05:28.720 simplicity I'm just using a sum - I  could also be using a weighted sum. 00:05:28.720 --> 00:05:32.710 The index can be used as an  independent variable or as a   00:05:32.710 --> 00:05:34.720 dependent variable in the regression analysis. 00:05:34.720 --> 00:05:39.730 If we use the independent the regression  - the index as an independent variable   00:05:39.730 --> 00:05:47.380 that is the same as assuming that all  of these variables x1 x2 and x3 here   00:05:47.380 --> 00:05:51.670 in the regression model have the same  effect beta 1 on the dependent variable. 00:05:51.670 --> 00:05:57.940 Does it make sense to assume that they are the  same? The model probably is not strictly correct   00:05:57.940 --> 00:06:05.110 but if we're interested in just understanding the  overall level of spending of a school then taking   00:06:05.110 --> 00:06:11.440 the sum of different spending categories would  be okay. If we are interested in understanding   00:06:11.440 --> 00:06:18.670 the effects on persons height and weight on  for example how much the person exercises then   00:06:18.670 --> 00:06:24.400 assuming that those two height and weight have  the same effect would be unreasonable because   00:06:24.400 --> 00:06:28.630 that's so unrealistic and we normally want to  know the different effects of height and weight. 00:06:28.630 --> 00:06:35.440 So what if we have the index as a dependent  variable. The scenario is very similar. In   00:06:35.440 --> 00:06:40.600 this case are having an index as a dependent  variable is the same as having a separate   00:06:40.600 --> 00:06:47.050 regression model for each component that  goes to the index. And we assume that the   00:06:47.050 --> 00:06:54.130 independent variable Z here has the same effect  beta 1 on each part of the index component. 00:06:54.130 --> 00:06:59.830 For example if we are modeling how much  the change in the principal of a school   00:06:59.830 --> 00:07:05.560 influences spending then running that kind  of model would be reasonable if we're only   00:07:05.560 --> 00:07:11.110 interested in the overall level of spending and  not specifically on any spending categories. 00:07:11.110 --> 00:07:17.260 Trying to understand how - where  the exercise influences the sum   00:07:17.260 --> 00:07:21.280 of your height and weight would  be unreasonable because you can't   00:07:21.280 --> 00:07:25.060 change your height by exercising  but you can influence your weight. 00:07:25.060 --> 00:07:30.430 So whether it makes sense to  use indices can be also thought   00:07:30.430 --> 00:07:35.860 through this approach. Does it make  sense to assume or approximate all   00:07:35.860 --> 00:07:41.950 these effects to be the same as  effect or as causes of the index? 00:07:41.950 --> 00:07:46.870 So as summary of how do you do indices  and when would you like to make one? 00:07:46.870 --> 00:07:55.270 The idea of indices is that the constructional  index doesn't validate anything. So you can   00:07:55.270 --> 00:08:00.640 take sums of things that are unrelated sums  of things that are invalid or unreliable.   00:08:00.640 --> 00:08:05.650 Just that of taking a sum does not provide  you any reliability or validity evidence. 00:08:05.650 --> 00:08:12.340 Therefore if you take variables and do a  sum you have to validate and assess the   00:08:12.340 --> 00:08:17.590 reliability separately before you start  forming the index. If we are doing a for   00:08:17.590 --> 00:08:23.560 example stock indices then we know that the  stock values - individual stock values - are   00:08:23.560 --> 00:08:28.300 valid and reliable by definition because  that's - the numbers are what they are. 00:08:28.300 --> 00:08:34.240 If we have survey measures we ask  people how much wine they drink how   00:08:34.240 --> 00:08:37.300 much beer they drink how much hard  liquor they drink. Then we have to   00:08:37.300 --> 00:08:41.950 validate and assist reliability of those  survey measures before we form an index. 00:08:41.950 --> 00:08:50.710 Then we have to justify the index  and does it make sense to sum these   00:08:52.000 --> 00:08:55.840 different variables. I can think  of tow different justifications. 00:08:55.840 --> 00:09:00.340 One is that the variables that go  into the index present different   00:09:00.340 --> 00:09:06.730 quantities or different forms of the  same thing. So wine beer hard liquor   00:09:06.730 --> 00:09:13.000 all present different forms of alcohol.  Or they present different ways that the   00:09:13.000 --> 00:09:18.880 behavior can manifest. For example different  ways that you can redesign your supply chain. 00:09:18.880 --> 00:09:25.030 The third thing that you have to do is to  justify why are you using an index over   00:09:25.030 --> 00:09:31.510 separate items. And this has to do with the level  of theorizing. So are you interested in a higher   00:09:31.510 --> 00:09:38.350 level question of for example does supply chain  redesign matter at all or are you interested in   00:09:38.350 --> 00:09:43.900 understanding what kind of supply chain redesign  matters. Are you interested in whether the   00:09:43.900 --> 00:09:49.840 alcohol drinking or beer drinking versus wine  drinking causes health problems and so on. 00:09:49.840 --> 00:09:55.420 The justification can also rely on sample  size. So if you have a small sample then   00:09:55.420 --> 00:10:01.630 that sets limitations on what we can do.  If the sample is very small then estimating   00:10:01.630 --> 00:10:08.170 these different effects of different index  components is going to be so imprecise that   00:10:08.170 --> 00:10:13.090 it doesn't make sense. So in small samples  indices is probably more useful than the   00:10:13.090 --> 00:10:17.020 separate row variable. So that must  be justified in your research report. 00:10:17.020 --> 00:10:23.470 Then the final thing is that how do you  set the weights. The index weight should   00:10:23.470 --> 00:10:29.140 not be defined empirically because there is no  good way of doing so without causing problems. 00:10:29.140 --> 00:10:36.550 So you set the weights based on theory  for example if you have no idea of how   00:10:36.550 --> 00:10:41.980 much those different indicators should  contribute to the index - you don't know   00:10:41.980 --> 00:10:46.870 whether one indicator is more important  than another one - use equal weights and   00:10:46.870 --> 00:10:53.320 that's typically - that's probably  the best recommendation in any case. 00:10:53.320 --> 00:11:00.160 Finally if you do indices which often is  a reasonable thing to do - just to get   00:11:00.160 --> 00:11:05.290 your study published it may be a good idea  to avoid the term formative or formative   00:11:05.290 --> 00:11:11.620 measurement or causal indicator in your  study because otherwise reviewers will   00:11:11.620 --> 00:11:17.080 challenge you to defend that the indicators  caused the construct and that's the key   00:11:17.080 --> 00:11:20.740 problem. That's an unrealistic thing in  the formative measurement literature. 00:11:20.740 --> 00:11:24.970 So to summarize taking indices  of different variables is okay   00:11:24.970 --> 00:11:29.950 but saying that the indices cause the  construct that is the problematic idea.