WEBVTT WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:02.700 The biggest problem in the  formative measurement idea is   00:00:02.700 --> 00:00:05.190 the idea that the indicators cause the construct. 00:00:05.190 --> 00:00:10.710 There are also statistical issues in  how these models are specified and how   00:00:10.710 --> 00:00:14.940 particularly the models are identified. I  will explain a couple of these issues in   00:00:14.940 --> 00:00:19.290 this video. There are a couple more but they  are not as important as these two issues. 00:00:19.290 --> 00:00:27.540 The root of the problem is that a formative model  - where we specify this latent variable as a   00:00:27.540 --> 00:00:33.390 function of these observed variables in three in  this example and this unobserved error term - is   00:00:33.390 --> 00:00:39.030 not identified in itself. It's like a regression  analysis without the dependent variable basically. 00:00:39.030 --> 00:00:45.270 It's not identified because these correlations  within these three indicators are free and that   00:00:45.270 --> 00:00:50.820 consues all degrees of freedom and we don't  have any more information for estimating   00:00:50.820 --> 00:00:55.230 these paths or the variance of their error  term. So the degrees of freedom is negative. 00:00:55.230 --> 00:01:02.910 There are a couple of ways around this problem.  The most commonly recommended way is that we add   00:01:02.910 --> 00:01:09.180 two normal indicators. The literature informative  measurement calls these reflective indicators. 00:01:09.180 --> 00:01:16.470 So we specify that this latent variable  here actually is a common factor for these   00:01:16.470 --> 00:01:22.050 two measurements and these measurements are  added there for identification of the model. 00:01:22.050 --> 00:01:29.310 So this leads to an interesting problem  and the problem is actually that this   00:01:29.310 --> 00:01:37.320 latent variable here is now defined by these  two normal indicators instead of these three   00:01:37.320 --> 00:01:43.890 formative or causal indicators. So these  factors - these indicators these measure   00:01:43.890 --> 00:01:49.320 one and measure two actually give this  latent variable its identity and mean. 00:01:49.320 --> 00:01:55.440 So I've written a couple papers about this  topic but the problem essentially is that   00:01:55.440 --> 00:02:03.210 if these causal affirmative indicators  are not valid measures of this latent   00:02:03.210 --> 00:02:09.990 variable - but these indicators are  - then these weights or regression   00:02:09.990 --> 00:02:15.570 coefficients here will simply be estimated  as 0. So we have a normal latent variable   00:02:15.570 --> 00:02:20.730 measured with two indicators and then we  have three unrelated indicators that don't   00:02:20.730 --> 00:02:26.250 really have any relationship within latent  variable defined by these two variables here. 00:02:26.250 --> 00:02:32.340 So that's one problem. Another way of  thinking about this is that if we have   00:02:32.340 --> 00:02:39.030 these two indicators here that measure the  latent variable then these three indicators   00:02:39.030 --> 00:02:45.660 here at the bottom are - you don't need them.  So you can just define the model and measure   00:02:45.660 --> 00:02:49.410 it normally with these two indicators  and there are no problems with that. 00:02:49.410 --> 00:02:57.690 And that of course doesn't go well with the idea  that some concepts must be measured with these   00:02:57.690 --> 00:03:05.460 formative indicators. So that's one problem and  what's the cause of this phenomenon - that the   00:03:05.460 --> 00:03:12.180 meaning of this latent variable comes from these  two measures instead of these three measures - is   00:03:12.180 --> 00:03:19.620 that we have this error term here and the  error term guarantees that whatever these   00:03:19.620 --> 00:03:26.940 indicators represent then this error term will  make - because it's unrelated with these three   00:03:26.940 --> 00:03:32.580 indicators here - it makes the latent variable  to be a common factor of these two indicators. 00:03:32.580 --> 00:03:38.730 So if these three indicators are  conceptually unrelated to whatever   00:03:38.730 --> 00:03:44.460 these two indicators represent then the  error term here will compensate for that   00:03:44.460 --> 00:03:49.860 and we are basically just modeling the  error term with these three indicators   00:03:50.460 --> 00:03:54.780 instead of whatever we think that  these causal indicators here cause. 00:03:54.780 --> 00:04:02.280 So that's one problem and how we deal with  that problem? We can of course eliminate   00:04:02.280 --> 00:04:06.660 that problem by eliminating the error term  from the model. But that gives us a - leads   00:04:06.660 --> 00:04:11.280 to another problem. So let's consider this  kind of model. So here this is not a latent   00:04:11.280 --> 00:04:17.730 variable anymore because this formative latent  variable is actually just weighted sum of these   00:04:17.730 --> 00:04:21.870 indicators. There's no error and this is like  a regression analysis without an error term. 00:04:21.870 --> 00:04:27.150 Then how do we set these different  weights? So we create an index based   00:04:27.150 --> 00:04:31.980 on three different indicators. We  set these weights. The normal way   00:04:31.980 --> 00:04:38.100 of defining this use or specifying this  kind of model is that we have this latent   00:04:38.100 --> 00:04:43.170 variable here with other error term and then  we have another latent variable that we want   00:04:43.170 --> 00:04:47.700 to explain with this latent variable  and we have a regression relationship. 00:04:47.700 --> 00:04:55.110 Specifying a model like that defines these  weights so that they maximize this path.   00:04:55.110 --> 00:05:03.000 And it's that problematic or not? Well it  is problematic because if we want to test   00:05:03.000 --> 00:05:10.110 for example whether this beta here is zero or  not whether the beta has an effect whether this   00:05:10.110 --> 00:05:16.350 formative LV has an effect on this other latent  variable then setting these weights so that the   00:05:16.350 --> 00:05:21.570 beta is as large as possible it's probably the  worst possible way that you can create an index. 00:05:21.570 --> 00:05:28.080 So if you want to test if something exists  then trying to argue any correlations in   00:05:28.080 --> 00:05:33.060 your data to make your estimate as large as  possible it's not a good estimation principle. 00:05:33.060 --> 00:05:41.010 So there's possible positive bias. There  is also another problem is that if we   00:05:41.010 --> 00:05:46.740 set these weights so that this beta is  as large as possible then the weights   00:05:46.740 --> 00:05:54.810 actually depend on whatever this other  latent variable is and this leads to a   00:05:54.810 --> 00:05:58.230 problem called interpretational  confounding in this literature. 00:05:58.230 --> 00:06:03.180 So the meaning of this latent variable  here - that is supposed to be caused   00:06:03.180 --> 00:06:08.370 by these three formative indicators -  actually depends on what's the other   00:06:08.370 --> 00:06:13.110 latent variable with other variables we  have in the model. And that's undesirable. 00:06:13.110 --> 00:06:20.520 So if you think about the stock index.  Would it make sense that the stock index   00:06:20.520 --> 00:06:25.260 would be different depending on who is  using the index? I don't think so. It   00:06:25.260 --> 00:06:30.780 should be the same. So the meaning of the  index should be same across studies which   00:06:30.780 --> 00:06:35.670 means that these indicators - these  weights - also must stay the same. 00:06:35.670 --> 00:06:42.580 Then there's also the assumption that if these  indicators here have any effect on this other   00:06:42.580 --> 00:06:48.970 latent variable - then they must be fully  mediated by this formative latent variable. 00:06:48.970 --> 00:06:54.520 So let's consider socioeconomic  status. So that's our formative   00:06:54.520 --> 00:07:02.440 latent variable. One of the indicators  measure is your education and then we   00:07:02.440 --> 00:07:08.560 want to explain child's education  with parents socioeconomic status. 00:07:08.560 --> 00:07:15.250 Is it reasonable to assume that the  parents education has no other causal   00:07:15.250 --> 00:07:21.070 effect on child's education than through the  full mediation through social economic status?   00:07:21.070 --> 00:07:27.460 That is clearly unreasonable. So that full  mediation assumption here is also unreasonable. 00:07:27.460 --> 00:07:37.030 So what's the alternative? The solution is to  define these weights based on theories. So you   00:07:37.030 --> 00:07:43.180 set the weights based on your understanding of  the phenomenon instead of trying to estimate   00:07:43.180 --> 00:07:50.110 the weights empirically and that leads to  index construction. So instead of doing   00:07:50.110 --> 00:07:56.680 this complicated latent variable model that  possibly has an error term - we just take the   00:07:56.680 --> 00:08:03.310 indicators and we take a mean or we take  a sum or we take a weighted sum and we do   00:08:03.310 --> 00:08:09.100 that before our estimation and we define  the weights for the index construction   00:08:09.100 --> 00:08:12.370 based on existing understanding of  the phenomenon and all the theory, 00:08:12.370 --> 00:08:17.470 And I have another video of how  you can actually do that and how   00:08:17.470 --> 00:08:23.320 you justify index construction. So  that's clearly a good approach. A lot   00:08:23.320 --> 00:08:28.180 better approach than trying to specify  these formative latent variable models.