WEBVTT

WEBVTT
Kind: captions
Language: en

00:00:00.030 --> 00:00:02.700
The biggest problem in the 
formative measurement idea is  

00:00:02.700 --> 00:00:05.190
the idea that the indicators cause the construct.

00:00:05.190 --> 00:00:10.710
There are also statistical issues in 
how these models are specified and how  

00:00:10.710 --> 00:00:14.940
particularly the models are identified. I 
will explain a couple of these issues in  

00:00:14.940 --> 00:00:19.290
this video. There are a couple more but they 
are not as important as these two issues.

00:00:19.290 --> 00:00:27.540
The root of the problem is that a formative model 
- where we specify this latent variable as a  

00:00:27.540 --> 00:00:33.390
function of these observed variables in three in 
this example and this unobserved error term - is  

00:00:33.390 --> 00:00:39.030
not identified in itself. It's like a regression 
analysis without the dependent variable basically.

00:00:39.030 --> 00:00:45.270
It's not identified because these correlations 
within these three indicators are free and that  

00:00:45.270 --> 00:00:50.820
consues all degrees of freedom and we don't 
have any more information for estimating  

00:00:50.820 --> 00:00:55.230
these paths or the variance of their error 
term. So the degrees of freedom is negative.

00:00:55.230 --> 00:01:02.910
There are a couple of ways around this problem. 
The most commonly recommended way is that we add  

00:01:02.910 --> 00:01:09.180
two normal indicators. The literature informative 
measurement calls these reflective indicators.

00:01:09.180 --> 00:01:16.470
So we specify that this latent variable 
here actually is a common factor for these  

00:01:16.470 --> 00:01:22.050
two measurements and these measurements are 
added there for identification of the model.

00:01:22.050 --> 00:01:29.310
So this leads to an interesting problem 
and the problem is actually that this  

00:01:29.310 --> 00:01:37.320
latent variable here is now defined by these 
two normal indicators instead of these three  

00:01:37.320 --> 00:01:43.890
formative or causal indicators. So these 
factors - these indicators these measure  

00:01:43.890 --> 00:01:49.320
one and measure two actually give this 
latent variable its identity and mean.

00:01:49.320 --> 00:01:55.440
So I've written a couple papers about this 
topic but the problem essentially is that  

00:01:55.440 --> 00:02:03.210
if these causal affirmative indicators 
are not valid measures of this latent  

00:02:03.210 --> 00:02:09.990
variable - but these indicators are 
- then these weights or regression  

00:02:09.990 --> 00:02:15.570
coefficients here will simply be estimated 
as 0. So we have a normal latent variable  

00:02:15.570 --> 00:02:20.730
measured with two indicators and then we 
have three unrelated indicators that don't  

00:02:20.730 --> 00:02:26.250
really have any relationship within latent 
variable defined by these two variables here.

00:02:26.250 --> 00:02:32.340
So that's one problem. Another way of 
thinking about this is that if we have  

00:02:32.340 --> 00:02:39.030
these two indicators here that measure the 
latent variable then these three indicators  

00:02:39.030 --> 00:02:45.660
here at the bottom are - you don't need them. 
So you can just define the model and measure  

00:02:45.660 --> 00:02:49.410
it normally with these two indicators 
and there are no problems with that.

00:02:49.410 --> 00:02:57.690
And that of course doesn't go well with the idea 
that some concepts must be measured with these  

00:02:57.690 --> 00:03:05.460
formative indicators. So that's one problem and 
what's the cause of this phenomenon - that the  

00:03:05.460 --> 00:03:12.180
meaning of this latent variable comes from these 
two measures instead of these three measures - is  

00:03:12.180 --> 00:03:19.620
that we have this error term here and the 
error term guarantees that whatever these  

00:03:19.620 --> 00:03:26.940
indicators represent then this error term will 
make - because it's unrelated with these three  

00:03:26.940 --> 00:03:32.580
indicators here - it makes the latent variable 
to be a common factor of these two indicators.

00:03:32.580 --> 00:03:38.730
So if these three indicators are 
conceptually unrelated to whatever  

00:03:38.730 --> 00:03:44.460
these two indicators represent then the 
error term here will compensate for that  

00:03:44.460 --> 00:03:49.860
and we are basically just modeling the 
error term with these three indicators  

00:03:50.460 --> 00:03:54.780
instead of whatever we think that 
these causal indicators here cause.

00:03:54.780 --> 00:04:02.280
So that's one problem and how we deal with 
that problem? We can of course eliminate  

00:04:02.280 --> 00:04:06.660
that problem by eliminating the error term 
from the model. But that gives us a - leads  

00:04:06.660 --> 00:04:11.280
to another problem. So let's consider this 
kind of model. So here this is not a latent  

00:04:11.280 --> 00:04:17.730
variable anymore because this formative latent 
variable is actually just weighted sum of these  

00:04:17.730 --> 00:04:21.870
indicators. There's no error and this is like 
a regression analysis without an error term.

00:04:21.870 --> 00:04:27.150
Then how do we set these different 
weights? So we create an index based  

00:04:27.150 --> 00:04:31.980
on three different indicators. We 
set these weights. The normal way  

00:04:31.980 --> 00:04:38.100
of defining this use or specifying this 
kind of model is that we have this latent  

00:04:38.100 --> 00:04:43.170
variable here with other error term and then 
we have another latent variable that we want  

00:04:43.170 --> 00:04:47.700
to explain with this latent variable 
and we have a regression relationship.

00:04:47.700 --> 00:04:55.110
Specifying a model like that defines these 
weights so that they maximize this path.  

00:04:55.110 --> 00:05:03.000
And it's that problematic or not? Well it 
is problematic because if we want to test  

00:05:03.000 --> 00:05:10.110
for example whether this beta here is zero or 
not whether the beta has an effect whether this  

00:05:10.110 --> 00:05:16.350
formative LV has an effect on this other latent 
variable then setting these weights so that the  

00:05:16.350 --> 00:05:21.570
beta is as large as possible it's probably the 
worst possible way that you can create an index.

00:05:21.570 --> 00:05:28.080
So if you want to test if something exists 
then trying to argue any correlations in  

00:05:28.080 --> 00:05:33.060
your data to make your estimate as large as 
possible it's not a good estimation principle.

00:05:33.060 --> 00:05:41.010
So there's possible positive bias. There 
is also another problem is that if we  

00:05:41.010 --> 00:05:46.740
set these weights so that this beta is 
as large as possible then the weights  

00:05:46.740 --> 00:05:54.810
actually depend on whatever this other 
latent variable is and this leads to a  

00:05:54.810 --> 00:05:58.230
problem called interpretational 
confounding in this literature.

00:05:58.230 --> 00:06:03.180
So the meaning of this latent variable 
here - that is supposed to be caused  

00:06:03.180 --> 00:06:08.370
by these three formative indicators - 
actually depends on what's the other  

00:06:08.370 --> 00:06:13.110
latent variable with other variables we 
have in the model. And that's undesirable.

00:06:13.110 --> 00:06:20.520
So if you think about the stock index. 
Would it make sense that the stock index  

00:06:20.520 --> 00:06:25.260
would be different depending on who is 
using the index? I don't think so. It  

00:06:25.260 --> 00:06:30.780
should be the same. So the meaning of the 
index should be same across studies which  

00:06:30.780 --> 00:06:35.670
means that these indicators - these 
weights - also must stay the same.

00:06:35.670 --> 00:06:42.580
Then there's also the assumption that if these 
indicators here have any effect on this other  

00:06:42.580 --> 00:06:48.970
latent variable - then they must be fully 
mediated by this formative latent variable.

00:06:48.970 --> 00:06:54.520
So let's consider socioeconomic 
status. So that's our formative  

00:06:54.520 --> 00:07:02.440
latent variable. One of the indicators 
measure is your education and then we  

00:07:02.440 --> 00:07:08.560
want to explain child's education 
with parents socioeconomic status.

00:07:08.560 --> 00:07:15.250
Is it reasonable to assume that the 
parents education has no other causal  

00:07:15.250 --> 00:07:21.070
effect on child's education than through the 
full mediation through social economic status?  

00:07:21.070 --> 00:07:27.460
That is clearly unreasonable. So that full 
mediation assumption here is also unreasonable.

00:07:27.460 --> 00:07:37.030
So what's the alternative? The solution is to 
define these weights based on theories. So you  

00:07:37.030 --> 00:07:43.180
set the weights based on your understanding of 
the phenomenon instead of trying to estimate  

00:07:43.180 --> 00:07:50.110
the weights empirically and that leads to 
index construction. So instead of doing  

00:07:50.110 --> 00:07:56.680
this complicated latent variable model that 
possibly has an error term - we just take the  

00:07:56.680 --> 00:08:03.310
indicators and we take a mean or we take 
a sum or we take a weighted sum and we do  

00:08:03.310 --> 00:08:09.100
that before our estimation and we define 
the weights for the index construction  

00:08:09.100 --> 00:08:12.370
based on existing understanding of 
the phenomenon and all the theory,

00:08:12.370 --> 00:08:17.470
And I have another video of how 
you can actually do that and how  

00:08:17.470 --> 00:08:23.320
you justify index construction. So 
that's clearly a good approach. A lot  

00:08:23.320 --> 00:08:28.180
better approach than trying to specify 
these formative latent variable models.