WEBVTT
WEBVTT
Kind: captions
Language: en
00:00:00.030 --> 00:00:02.700
The biggest problem in the
formative measurement idea is
00:00:02.700 --> 00:00:05.190
the idea that the indicators cause the construct.
00:00:05.190 --> 00:00:10.710
There are also statistical issues in
how these models are specified and how
00:00:10.710 --> 00:00:14.940
particularly the models are identified. I
will explain a couple of these issues in
00:00:14.940 --> 00:00:19.290
this video. There are a couple more but they
are not as important as these two issues.
00:00:19.290 --> 00:00:27.540
The root of the problem is that a formative model
- where we specify this latent variable as a
00:00:27.540 --> 00:00:33.390
function of these observed variables in three in
this example and this unobserved error term - is
00:00:33.390 --> 00:00:39.030
not identified in itself. It's like a regression
analysis without the dependent variable basically.
00:00:39.030 --> 00:00:45.270
It's not identified because these correlations
within these three indicators are free and that
00:00:45.270 --> 00:00:50.820
consues all degrees of freedom and we don't
have any more information for estimating
00:00:50.820 --> 00:00:55.230
these paths or the variance of their error
term. So the degrees of freedom is negative.
00:00:55.230 --> 00:01:02.910
There are a couple of ways around this problem.
The most commonly recommended way is that we add
00:01:02.910 --> 00:01:09.180
two normal indicators. The literature informative
measurement calls these reflective indicators.
00:01:09.180 --> 00:01:16.470
So we specify that this latent variable
here actually is a common factor for these
00:01:16.470 --> 00:01:22.050
two measurements and these measurements are
added there for identification of the model.
00:01:22.050 --> 00:01:29.310
So this leads to an interesting problem
and the problem is actually that this
00:01:29.310 --> 00:01:37.320
latent variable here is now defined by these
two normal indicators instead of these three
00:01:37.320 --> 00:01:43.890
formative or causal indicators. So these
factors - these indicators these measure
00:01:43.890 --> 00:01:49.320
one and measure two actually give this
latent variable its identity and mean.
00:01:49.320 --> 00:01:55.440
So I've written a couple papers about this
topic but the problem essentially is that
00:01:55.440 --> 00:02:03.210
if these causal affirmative indicators
are not valid measures of this latent
00:02:03.210 --> 00:02:09.990
variable - but these indicators are
- then these weights or regression
00:02:09.990 --> 00:02:15.570
coefficients here will simply be estimated
as 0. So we have a normal latent variable
00:02:15.570 --> 00:02:20.730
measured with two indicators and then we
have three unrelated indicators that don't
00:02:20.730 --> 00:02:26.250
really have any relationship within latent
variable defined by these two variables here.
00:02:26.250 --> 00:02:32.340
So that's one problem. Another way of
thinking about this is that if we have
00:02:32.340 --> 00:02:39.030
these two indicators here that measure the
latent variable then these three indicators
00:02:39.030 --> 00:02:45.660
here at the bottom are - you don't need them.
So you can just define the model and measure
00:02:45.660 --> 00:02:49.410
it normally with these two indicators
and there are no problems with that.
00:02:49.410 --> 00:02:57.690
And that of course doesn't go well with the idea
that some concepts must be measured with these
00:02:57.690 --> 00:03:05.460
formative indicators. So that's one problem and
what's the cause of this phenomenon - that the
00:03:05.460 --> 00:03:12.180
meaning of this latent variable comes from these
two measures instead of these three measures - is
00:03:12.180 --> 00:03:19.620
that we have this error term here and the
error term guarantees that whatever these
00:03:19.620 --> 00:03:26.940
indicators represent then this error term will
make - because it's unrelated with these three
00:03:26.940 --> 00:03:32.580
indicators here - it makes the latent variable
to be a common factor of these two indicators.
00:03:32.580 --> 00:03:38.730
So if these three indicators are
conceptually unrelated to whatever
00:03:38.730 --> 00:03:44.460
these two indicators represent then the
error term here will compensate for that
00:03:44.460 --> 00:03:49.860
and we are basically just modeling the
error term with these three indicators
00:03:50.460 --> 00:03:54.780
instead of whatever we think that
these causal indicators here cause.
00:03:54.780 --> 00:04:02.280
So that's one problem and how we deal with
that problem? We can of course eliminate
00:04:02.280 --> 00:04:06.660
that problem by eliminating the error term
from the model. But that gives us a - leads
00:04:06.660 --> 00:04:11.280
to another problem. So let's consider this
kind of model. So here this is not a latent
00:04:11.280 --> 00:04:17.730
variable anymore because this formative latent
variable is actually just weighted sum of these
00:04:17.730 --> 00:04:21.870
indicators. There's no error and this is like
a regression analysis without an error term.
00:04:21.870 --> 00:04:27.150
Then how do we set these different
weights? So we create an index based
00:04:27.150 --> 00:04:31.980
on three different indicators. We
set these weights. The normal way
00:04:31.980 --> 00:04:38.100
of defining this use or specifying this
kind of model is that we have this latent
00:04:38.100 --> 00:04:43.170
variable here with other error term and then
we have another latent variable that we want
00:04:43.170 --> 00:04:47.700
to explain with this latent variable
and we have a regression relationship.
00:04:47.700 --> 00:04:55.110
Specifying a model like that defines these
weights so that they maximize this path.
00:04:55.110 --> 00:05:03.000
And it's that problematic or not? Well it
is problematic because if we want to test
00:05:03.000 --> 00:05:10.110
for example whether this beta here is zero or
not whether the beta has an effect whether this
00:05:10.110 --> 00:05:16.350
formative LV has an effect on this other latent
variable then setting these weights so that the
00:05:16.350 --> 00:05:21.570
beta is as large as possible it's probably the
worst possible way that you can create an index.
00:05:21.570 --> 00:05:28.080
So if you want to test if something exists
then trying to argue any correlations in
00:05:28.080 --> 00:05:33.060
your data to make your estimate as large as
possible it's not a good estimation principle.
00:05:33.060 --> 00:05:41.010
So there's possible positive bias. There
is also another problem is that if we
00:05:41.010 --> 00:05:46.740
set these weights so that this beta is
as large as possible then the weights
00:05:46.740 --> 00:05:54.810
actually depend on whatever this other
latent variable is and this leads to a
00:05:54.810 --> 00:05:58.230
problem called interpretational
confounding in this literature.
00:05:58.230 --> 00:06:03.180
So the meaning of this latent variable
here - that is supposed to be caused
00:06:03.180 --> 00:06:08.370
by these three formative indicators -
actually depends on what's the other
00:06:08.370 --> 00:06:13.110
latent variable with other variables we
have in the model. And that's undesirable.
00:06:13.110 --> 00:06:20.520
So if you think about the stock index.
Would it make sense that the stock index
00:06:20.520 --> 00:06:25.260
would be different depending on who is
using the index? I don't think so. It
00:06:25.260 --> 00:06:30.780
should be the same. So the meaning of the
index should be same across studies which
00:06:30.780 --> 00:06:35.670
means that these indicators - these
weights - also must stay the same.
00:06:35.670 --> 00:06:42.580
Then there's also the assumption that if these
indicators here have any effect on this other
00:06:42.580 --> 00:06:48.970
latent variable - then they must be fully
mediated by this formative latent variable.
00:06:48.970 --> 00:06:54.520
So let's consider socioeconomic
status. So that's our formative
00:06:54.520 --> 00:07:02.440
latent variable. One of the indicators
measure is your education and then we
00:07:02.440 --> 00:07:08.560
want to explain child's education
with parents socioeconomic status.
00:07:08.560 --> 00:07:15.250
Is it reasonable to assume that the
parents education has no other causal
00:07:15.250 --> 00:07:21.070
effect on child's education than through the
full mediation through social economic status?
00:07:21.070 --> 00:07:27.460
That is clearly unreasonable. So that full
mediation assumption here is also unreasonable.
00:07:27.460 --> 00:07:37.030
So what's the alternative? The solution is to
define these weights based on theories. So you
00:07:37.030 --> 00:07:43.180
set the weights based on your understanding of
the phenomenon instead of trying to estimate
00:07:43.180 --> 00:07:50.110
the weights empirically and that leads to
index construction. So instead of doing
00:07:50.110 --> 00:07:56.680
this complicated latent variable model that
possibly has an error term - we just take the
00:07:56.680 --> 00:08:03.310
indicators and we take a mean or we take
a sum or we take a weighted sum and we do
00:08:03.310 --> 00:08:09.100
that before our estimation and we define
the weights for the index construction
00:08:09.100 --> 00:08:12.370
based on existing understanding of
the phenomenon and all the theory,
00:08:12.370 --> 00:08:17.470
And I have another video of how
you can actually do that and how
00:08:17.470 --> 00:08:23.320
you justify index construction. So
that's clearly a good approach. A lot
00:08:23.320 --> 00:08:28.180
better approach than trying to specify
these formative latent variable models.