WEBVTT WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:02.700 The biggest problem in the formative measurement idea is 00:00:02.700 --> 00:00:05.190 the idea that the indicators cause the construct. 00:00:05.190 --> 00:00:10.710 There are also statistical issues in how these models are specified and how 00:00:10.710 --> 00:00:14.940 particularly the models are identified. I will explain a couple of these issues in 00:00:14.940 --> 00:00:19.290 this video. There are a couple more but they are not as important as these two issues. 00:00:19.290 --> 00:00:27.540 The root of the problem is that a formative model - where we specify this latent variable as a 00:00:27.540 --> 00:00:33.390 function of these observed variables in three in this example and this unobserved error term - is 00:00:33.390 --> 00:00:39.030 not identified in itself. It's like a regression analysis without the dependent variable basically. 00:00:39.030 --> 00:00:45.270 It's not identified because these correlations within these three indicators are free and that 00:00:45.270 --> 00:00:50.820 consues all degrees of freedom and we don't have any more information for estimating 00:00:50.820 --> 00:00:55.230 these paths or the variance of their error term. So the degrees of freedom is negative. 00:00:55.230 --> 00:01:02.910 There are a couple of ways around this problem. The most commonly recommended way is that we add 00:01:02.910 --> 00:01:09.180 two normal indicators. The literature informative measurement calls these reflective indicators. 00:01:09.180 --> 00:01:16.470 So we specify that this latent variable here actually is a common factor for these 00:01:16.470 --> 00:01:22.050 two measurements and these measurements are added there for identification of the model. 00:01:22.050 --> 00:01:29.310 So this leads to an interesting problem and the problem is actually that this 00:01:29.310 --> 00:01:37.320 latent variable here is now defined by these two normal indicators instead of these three 00:01:37.320 --> 00:01:43.890 formative or causal indicators. So these factors - these indicators these measure 00:01:43.890 --> 00:01:49.320 one and measure two actually give this latent variable its identity and mean. 00:01:49.320 --> 00:01:55.440 So I've written a couple papers about this topic but the problem essentially is that 00:01:55.440 --> 00:02:03.210 if these causal affirmative indicators are not valid measures of this latent 00:02:03.210 --> 00:02:09.990 variable - but these indicators are - then these weights or regression 00:02:09.990 --> 00:02:15.570 coefficients here will simply be estimated as 0. So we have a normal latent variable 00:02:15.570 --> 00:02:20.730 measured with two indicators and then we have three unrelated indicators that don't 00:02:20.730 --> 00:02:26.250 really have any relationship within latent variable defined by these two variables here. 00:02:26.250 --> 00:02:32.340 So that's one problem. Another way of thinking about this is that if we have 00:02:32.340 --> 00:02:39.030 these two indicators here that measure the latent variable then these three indicators 00:02:39.030 --> 00:02:45.660 here at the bottom are - you don't need them. So you can just define the model and measure 00:02:45.660 --> 00:02:49.410 it normally with these two indicators and there are no problems with that. 00:02:49.410 --> 00:02:57.690 And that of course doesn't go well with the idea that some concepts must be measured with these 00:02:57.690 --> 00:03:05.460 formative indicators. So that's one problem and what's the cause of this phenomenon - that the 00:03:05.460 --> 00:03:12.180 meaning of this latent variable comes from these two measures instead of these three measures - is 00:03:12.180 --> 00:03:19.620 that we have this error term here and the error term guarantees that whatever these 00:03:19.620 --> 00:03:26.940 indicators represent then this error term will make - because it's unrelated with these three 00:03:26.940 --> 00:03:32.580 indicators here - it makes the latent variable to be a common factor of these two indicators. 00:03:32.580 --> 00:03:38.730 So if these three indicators are conceptually unrelated to whatever 00:03:38.730 --> 00:03:44.460 these two indicators represent then the error term here will compensate for that 00:03:44.460 --> 00:03:49.860 and we are basically just modeling the error term with these three indicators 00:03:50.460 --> 00:03:54.780 instead of whatever we think that these causal indicators here cause. 00:03:54.780 --> 00:04:02.280 So that's one problem and how we deal with that problem? We can of course eliminate 00:04:02.280 --> 00:04:06.660 that problem by eliminating the error term from the model. But that gives us a - leads 00:04:06.660 --> 00:04:11.280 to another problem. So let's consider this kind of model. So here this is not a latent 00:04:11.280 --> 00:04:17.730 variable anymore because this formative latent variable is actually just weighted sum of these 00:04:17.730 --> 00:04:21.870 indicators. There's no error and this is like a regression analysis without an error term. 00:04:21.870 --> 00:04:27.150 Then how do we set these different weights? So we create an index based 00:04:27.150 --> 00:04:31.980 on three different indicators. We set these weights. The normal way 00:04:31.980 --> 00:04:38.100 of defining this use or specifying this kind of model is that we have this latent 00:04:38.100 --> 00:04:43.170 variable here with other error term and then we have another latent variable that we want 00:04:43.170 --> 00:04:47.700 to explain with this latent variable and we have a regression relationship. 00:04:47.700 --> 00:04:55.110 Specifying a model like that defines these weights so that they maximize this path. 00:04:55.110 --> 00:05:03.000 And it's that problematic or not? Well it is problematic because if we want to test 00:05:03.000 --> 00:05:10.110 for example whether this beta here is zero or not whether the beta has an effect whether this 00:05:10.110 --> 00:05:16.350 formative LV has an effect on this other latent variable then setting these weights so that the 00:05:16.350 --> 00:05:21.570 beta is as large as possible it's probably the worst possible way that you can create an index. 00:05:21.570 --> 00:05:28.080 So if you want to test if something exists then trying to argue any correlations in 00:05:28.080 --> 00:05:33.060 your data to make your estimate as large as possible it's not a good estimation principle. 00:05:33.060 --> 00:05:41.010 So there's possible positive bias. There is also another problem is that if we 00:05:41.010 --> 00:05:46.740 set these weights so that this beta is as large as possible then the weights 00:05:46.740 --> 00:05:54.810 actually depend on whatever this other latent variable is and this leads to a 00:05:54.810 --> 00:05:58.230 problem called interpretational confounding in this literature. 00:05:58.230 --> 00:06:03.180 So the meaning of this latent variable here - that is supposed to be caused 00:06:03.180 --> 00:06:08.370 by these three formative indicators - actually depends on what's the other 00:06:08.370 --> 00:06:13.110 latent variable with other variables we have in the model. And that's undesirable. 00:06:13.110 --> 00:06:20.520 So if you think about the stock index. Would it make sense that the stock index 00:06:20.520 --> 00:06:25.260 would be different depending on who is using the index? I don't think so. It 00:06:25.260 --> 00:06:30.780 should be the same. So the meaning of the index should be same across studies which 00:06:30.780 --> 00:06:35.670 means that these indicators - these weights - also must stay the same. 00:06:35.670 --> 00:06:42.580 Then there's also the assumption that if these indicators here have any effect on this other 00:06:42.580 --> 00:06:48.970 latent variable - then they must be fully mediated by this formative latent variable. 00:06:48.970 --> 00:06:54.520 So let's consider socioeconomic status. So that's our formative 00:06:54.520 --> 00:07:02.440 latent variable. One of the indicators measure is your education and then we 00:07:02.440 --> 00:07:08.560 want to explain child's education with parents socioeconomic status. 00:07:08.560 --> 00:07:15.250 Is it reasonable to assume that the parents education has no other causal 00:07:15.250 --> 00:07:21.070 effect on child's education than through the full mediation through social economic status? 00:07:21.070 --> 00:07:27.460 That is clearly unreasonable. So that full mediation assumption here is also unreasonable. 00:07:27.460 --> 00:07:37.030 So what's the alternative? The solution is to define these weights based on theories. So you 00:07:37.030 --> 00:07:43.180 set the weights based on your understanding of the phenomenon instead of trying to estimate 00:07:43.180 --> 00:07:50.110 the weights empirically and that leads to index construction. So instead of doing 00:07:50.110 --> 00:07:56.680 this complicated latent variable model that possibly has an error term - we just take the 00:07:56.680 --> 00:08:03.310 indicators and we take a mean or we take a sum or we take a weighted sum and we do 00:08:03.310 --> 00:08:09.100 that before our estimation and we define the weights for the index construction 00:08:09.100 --> 00:08:12.370 based on existing understanding of the phenomenon and all the theory, 00:08:12.370 --> 00:08:17.470 And I have another video of how you can actually do that and how 00:08:17.470 --> 00:08:23.320 you justify index construction. So that's clearly a good approach. A lot 00:08:23.320 --> 00:08:28.180 better approach than trying to specify these formative latent variable models.