WEBVTT
WEBVTT
Kind: captions
Language: en
00:00:00.120 --> 00:00:03.480
There are two things that we need
to consider before we can even
00:00:03.480 --> 00:00:09.660
start estimating confirmatory factor analysis
model called scale setting and identification.
00:00:09.660 --> 00:00:14.790
The scale setting means that every variable
must have a metric. So we have to be able to
00:00:14.790 --> 00:00:21.150
estimate the variance and sometimes the mean
of every variable. An identification means
00:00:21.150 --> 00:00:26.460
that the data provides enough information to
estimate the model that we want to estimate.
00:00:26.460 --> 00:00:34.080
So the confirmatory factor analysis framework is
very flexible and it's possible to define models
00:00:34.080 --> 00:00:40.080
that are mathematically impossible to estimate
uniquely. So in this video we will go through
00:00:40.080 --> 00:00:45.870
what requirements you have to consider before
you can even estimate the model meaningfully.
00:00:45.870 --> 00:00:50.520
Let's take a look at this model
with just two indicators. We have
00:00:50.520 --> 00:00:57.420
indicator a1 and a2 and then we want
to estimate factor A. And we have two
00:00:57.420 --> 00:01:01.470
variances - these two error variances here
- and then we have two factor loadings. So
00:01:01.470 --> 00:01:06.990
we have four things that we want to
estimate and so four free parameters.
00:01:06.990 --> 00:01:13.680
Then we start estimating it. We calculate the
model implied correlations. So we have two
00:01:13.680 --> 00:01:20.970
variances. Variance of a3 a2 and variance of
a1 and then one correlation. So we have three
00:01:20.970 --> 00:01:28.290
unique elements of information from the data
that we model using these four parameters.
00:01:28.290 --> 00:01:34.260
The problem is that now we have three
units of information and we have four
00:01:34.260 --> 00:01:37.590
things that we want to estimate.
So the degrees of freedom is minus
00:01:37.590 --> 00:01:42.150
one and that can be estimated or
it can be estimated meaningfully.
00:01:42.150 --> 00:01:48.300
The reason is that or intuitive
understanding insists that are
00:01:48.300 --> 00:01:53.400
you cannot estimate four things from
a three things. So that's the idea.
00:01:53.400 --> 00:02:00.840
You have to have more information than
what you want to estimate. So this is
00:02:00.840 --> 00:02:05.430
not identified and there are ways that
we can simplify the model to actually be
00:02:05.430 --> 00:02:09.330
able to estimate something or we can add
more indicators to make it identified.
00:02:09.330 --> 00:02:16.020
So this is not identified because the degrees
of freedom is negative. And factor analysis
00:02:16.020 --> 00:02:21.570
without additional constraints always
requires at least three indicators.
00:02:21.570 --> 00:02:28.050
Factor analysis of two indicators only it's not a
very meaningful analysis anyway because while you
00:02:28.050 --> 00:02:34.920
can make it identified by saying that these factor
loadings for example are the same - that would
00:02:34.920 --> 00:02:40.980
identify the model - then the estimation wouldn't
give you any meaningful information anyway.
00:02:40.980 --> 00:02:48.600
So let's take another example or work more
with this example. So let's assume that our
00:02:48.600 --> 00:02:56.490
correlation matrix for this two factor model
each with one indicator is - so we have a1 and
00:02:56.490 --> 00:03:02.970
b1 they're corralated at 0.1 one and we have
three parameters that we want to estimate.
00:03:02.970 --> 00:03:10.800
So you can't. We have one correlation that depends
on three parameters and these other variances
00:03:10.800 --> 00:03:16.860
don't depend on the model or they do depend but
we don't really care about those in this video.
00:03:16.860 --> 00:03:24.030
So why is the correlation with a1 and b1
so low? There are basically three different
00:03:24.030 --> 00:03:30.810
options. It's possible that a1 and b1 are both
highly reliable indicators of these factors A
00:03:30.810 --> 00:03:38.910
and B. It's also possible that A and B are
just weakly correlated. It's also possible
00:03:38.910 --> 00:03:45.150
that A and B are highly correlated but a1 is
unreliable and therefore we observe only a
00:03:45.150 --> 00:03:51.000
small correlation or it's possible that A and
B are highly correlated but b1 is unreliable.
00:03:51.000 --> 00:03:57.240
The problem is that we cannot know
which of these three options is correct
00:03:57.240 --> 00:04:02.040
because they all have the same empirical
implication which is that this correlation
00:04:02.040 --> 00:04:08.610
here is quite small. So that's another
example of non identification problem.
00:04:08.610 --> 00:04:14.700
Here we are estimating five things so we have two
error variances. We have two factor loadings and
00:04:14.700 --> 00:04:20.040
one factor correlation. We are trying to estimate
it from just three elements of information.
00:04:20.040 --> 00:04:23.340
We can't do that. The model is not identified. We
00:04:23.340 --> 00:04:27.600
cannot know which one of these three
explanations is correct empirically.
00:04:27.600 --> 00:04:34.950
Of course we can then use theory and rule out one
of these base alternate explanations - based on
00:04:34.950 --> 00:04:39.270
theory - but that goes beyond our factor
analysis estimates and identification.
00:04:39.270 --> 00:04:43.650
So this model is not identified. It
cannot be estimated meaningfully.
00:04:43.650 --> 00:04:48.930
Let's take a look at scale setting
now. So the identification basically
00:04:48.930 --> 00:04:54.780
means that you have more information
than what you estimate. So the number
00:04:54.780 --> 00:05:00.750
of unique elements in the correlation matrix
of the indicators must exceed or be the same
00:05:00.750 --> 00:05:03.570
as the number of three parameters
that you estimate from the model.
00:05:03.570 --> 00:05:11.160
Okay. So normally we have - in exploratory factor
analysis we have standardized factors - so the
00:05:11.160 --> 00:05:16.950
idea is that all the factors have variances of
one means of zero in the exploratory analysis
00:05:16.950 --> 00:05:23.250
and that defines the scale of these variables. So
every variable must have a variance in exploratory
00:05:23.250 --> 00:05:30.030
analysis. The factors are scaled to have unit
variance so they're standardized and then all
00:05:30.030 --> 00:05:34.230
the factor loadings are then standardized
regression coefficients for that reason.
00:05:34.230 --> 00:05:42.000
Then what if we don't standardize the factor
so we are saying that instead of saying that
00:05:42.000 --> 00:05:49.230
the factors variance is one we are estimating
the factors variances. So we add these factor
00:05:49.230 --> 00:05:53.880
variances here and factor variance here so
we have fifteen free parameters. We still
00:05:53.880 --> 00:05:59.400
have 21 units of information from which we
estimate but we estimate 15 different things
00:05:59.400 --> 00:06:06.600
so the degrees of freedom is 6 which means
that this model is overidentified. So it's
00:06:06.600 --> 00:06:11.490
positive. So in principle it is possible
to estimate this model meaningfully.
00:06:11.490 --> 00:06:18.030
We can do the estimation. So let's assume
that that's our observed correlation matrix.
00:06:18.030 --> 00:06:24.240
That's our implied correlation matrix.
Then we can find the values for the Y
00:06:24.240 --> 00:06:32.610
and the lambdas. So that this employed matrix
reproduces this correlation matrix perfectly.
00:06:32.610 --> 00:06:38.370
In this case that's possible because
these correlations all have the same
00:06:38.370 --> 00:06:41.580
values. Generally in small samples you will never
00:06:41.580 --> 00:06:45.960
completely reproduce the data but in this
example you do just to simplify things.
00:06:45.960 --> 00:06:54.570
So we can estimate and that's one set of
estimates that will give you the exact fit
00:06:54.570 --> 00:06:58.710
between the observed variable observed correlation
matrix and the implied correlation matrix.
00:06:58.710 --> 00:07:09.270
So we're fine right. Turns out we have a small
problem because there's another set of estimates
00:07:09.270 --> 00:07:14.730
that also reproduce the correlation matrix
perfectly using the employment correlation
00:07:14.730 --> 00:07:22.080
matrix. So you can plug in these values to
the equations and see that they produce the
00:07:22.080 --> 00:07:28.890
exact same implied correlations. So we have
here factor A's variance is 1 versus factor
00:07:28.890 --> 00:07:36.450
B's variance 2 and therefore they are
produced the same fit. So what do we
00:07:36.450 --> 00:07:43.590
do? We can go and come up with indefinitely
many examples. So if factor A's variance is
00:07:43.590 --> 00:07:51.390
0.5 then we will all have a different values
with factor loadings but still the empirical
00:07:51.390 --> 00:07:56.130
correlation matrix is reproduced perfectly
using the model implied correlation matrix.
00:07:56.130 --> 00:08:03.420
So this the problem of scale setting of latent
variables in confirmatory factor analysis models.
00:08:03.420 --> 00:08:12.810
So we need to set the metric. So the factors
themselves because we don't observe the factors
00:08:12.810 --> 00:08:18.780
they are just arbitrary entries we don't know
whether they vary from 0 to 1 or 0 to 1 million
00:08:18.780 --> 00:08:25.350
or minus 5 to plus 10 or whatever. We don't know
their range. We don't know their variances. We
00:08:25.350 --> 00:08:32.310
don't know their means. We have to specify the
scale of each latent each factor ourselves.
00:08:32.310 --> 00:08:37.590
In exploratory analysis we typically
don't model means and then we assume
00:08:37.590 --> 00:08:41.670
that the variances or we fix the
variances of the factors to be ones.
00:08:41.670 --> 00:08:52.650
In confirmatory analysis there are reasons why
we don't fix the variances to pons. That I'll
00:08:52.650 --> 00:09:03.390
explain a bit later. But the problem generally
is that we must define whether we are talking
00:09:03.390 --> 00:09:10.980
about centimetres or inches - do we talk about
Celsius or Fahrenheit. They quantify the same
00:09:10.980 --> 00:09:19.140
exact thing and they are equally good measures
from a statistical perspective to measure of
00:09:19.140 --> 00:09:23.910
length or temperature. We have to agree
on what is the scale that you're using.
00:09:23.910 --> 00:09:32.580
So also a regression gives us the one
unit change - the effect of one unit
00:09:32.580 --> 00:09:37.050
changing in the independent variable
on the dependent variable - considering
00:09:37.050 --> 00:09:42.540
regression coefficients only makes
sense after we have considered how
00:09:42.540 --> 00:09:48.120
we define the unit. So what is the
unit of A and what is the unit of B.
00:09:48.120 --> 00:09:55.440
We have to set them manually. So we have to
decide a scale setting approach. In exploratory
00:09:55.440 --> 00:10:01.800
analysis as I said we typically say that factor
A and factor B on all factors have variances of
00:10:01.800 --> 00:10:08.550
one. That produces standardized factor loadings
which are standard that regression coefficients
00:10:08.550 --> 00:10:16.110
of the indicators on the factors or in the case
of uncorrelated factors they equal correlations.
00:10:16.110 --> 00:10:22.830
We use that in exploratory factor analysis.
We cannot use that in structure regression
00:10:22.830 --> 00:10:26.880
model. Structure regression model
is an extension of a factor analysis
00:10:26.880 --> 00:10:30.810
model where we allow regressing
relationships between the factors.
00:10:30.810 --> 00:10:36.090
The reason why we can't use this
approach is that the variation of
00:10:36.090 --> 00:10:41.370
an endogenous variable - so a variable that
depends on other variables - is the sum of
00:10:41.370 --> 00:10:47.070
those other variables. So we can't say the
variable's variance is one if that variance
00:10:47.070 --> 00:10:51.300
depends on other things in the model.
But that's that's beyond this video.
00:10:51.300 --> 00:11:00.570
Another very common approach is that we set
the first indicator to be fixed the first
00:11:00.570 --> 00:11:05.610
indicators loading to be one. And this
is the default scale setting approach in
00:11:05.610 --> 00:11:10.290
most structural regression modelling or
confirmatory factor analysis software.
00:11:10.290 --> 00:11:16.830
The reason is that this can be used pretty much
always regardless of what kind of variables
00:11:16.830 --> 00:11:21.420
we have here as A and B and what kind of
relationship will be specified between A
00:11:21.420 --> 00:11:29.370
and B. And the idea is that we scale that
- if we assume that classical test theory
00:11:29.370 --> 00:11:36.420
holds - so all these errors here are just
random noise - then the variance of A is
00:11:36.420 --> 00:11:43.950
whatever is the variance of the true score of
a1. So that's also appealing if we consider
00:11:43.950 --> 00:11:51.660
that the only source of error is random noise -
then the variance of factor A is the variation
00:11:51.660 --> 00:11:57.420
of a1 or what the various in a1 would be if it
wasn't contaminated with this random noise here.
00:11:57.420 --> 00:12:05.520
So that's also a one way - one reason why
this is appealing. It allows us to consider
00:12:05.520 --> 00:12:14.730
the scale of these indicators without error varies
assuming classical test theory holds for the data.
00:12:14.730 --> 00:12:21.150
And this is such a common approach
that there's a rule of thumb that I
00:12:21.150 --> 00:12:24.900
present. Always use the first
indicators to fix the scale.
00:12:24.900 --> 00:12:32.670
We can see that the papers - that we have used
as examples in these videos - are using this
00:12:32.670 --> 00:12:39.840
approach. Mesquita and Lazzarini - you can see all
loadings of first indicators are ones. So they set
00:12:39.840 --> 00:12:46.830
the scale of the latent variable by fixing this
loading to one and then they have the Z-statistic
00:12:46.830 --> 00:12:54.060
here and you can see that the indicators - the
first indicators - don't have a Z- statistic.
00:12:54.060 --> 00:12:59.910
The reason is that they are not estimated from
the data - instead a researcher says that these
00:12:59.910 --> 00:13:05.730
are ones they are not estimated if something
is not estimated it doesn't vary from sample
00:13:05.730 --> 00:13:12.840
to sample. So it doesn't have a standard error.
So we can't calculate or the Z-statistic for it.
00:13:12.840 --> 00:13:20.760
We can see the same in Yli-Renko's paper. So
Yli-Renko's paper - the first loading it's not
00:13:20.760 --> 00:13:27.000
one but it doesn't have a standard error and
he doesn't have a Z-statistic he doesn't have
00:13:27.000 --> 00:13:33.090
a standard error so that's indication that they
actually are fix the first loading to be one to
00:13:33.090 --> 00:13:39.750
identify or the scale the latent variables. If
you want to have standardized factor loadings
00:13:39.750 --> 00:13:47.160
so if you want to have loadings that are
expressed in the scale of the Exploratory
00:13:47.160 --> 00:13:53.880
analysis where the factor variances are ones
then you can rescale the confirmatory factor
00:13:53.880 --> 00:14:00.420
analysis results afterwards. Your software
will produce that for you if you check the
00:14:00.420 --> 00:14:05.520
standardized estimates option there. So
these are standardized estimates but the
00:14:05.520 --> 00:14:10.320
scaling has been done after estimation.
So you first estimate and unstandardized
00:14:10.320 --> 00:14:18.690
confirmatory factor analysis where each factor
is scaled by fixing the first indicator - then
00:14:18.690 --> 00:14:25.320
you scale the resulting solution. That's the same
approach that you use for standardized regression
00:14:25.320 --> 00:14:30.300
coefficients. You first estimate regression
then you scale the parameter estimates later.
00:14:30.300 --> 00:14:36.270
So the summary of identification of
confirmatory factor analysis models.
00:14:36.270 --> 00:14:44.730
A model is identified if every latent variable
has a scale and if the degrees of freedom is
00:14:44.730 --> 00:14:50.940
positive for and it's also every part
of the model has to be identified.
00:14:50.940 --> 00:14:57.870
In confirmatory factor analysis - after we have
established every latent variable every factor has
00:14:57.870 --> 00:15:05.130
a scale - then all factors with three indicators
are always identified. So three indicators if you
00:15:05.130 --> 00:15:10.710
have three variables you can always run a factor
analysis no matter what. Then if you have two
00:15:10.710 --> 00:15:18.270
factors -then we can either say that fix that
both are equally reliable. So we fix the factor
00:15:18.270 --> 00:15:27.690
loadings to be ones or we can embed this factor
in a larger system. So just two variables alone
00:15:27.690 --> 00:15:35.040
we can't estimate a factor model unless we fix
these factor loadings to be the same. If we embed
00:15:35.040 --> 00:15:42.540
this two factor - the two indicator factor
- into a larger factor analysis then we can
00:15:42.540 --> 00:15:47.940
estimate because we can use information from other
indicators to estimate these factor loadings.
00:15:47.940 --> 00:15:52.140
And one single indicator rule - if
we have a factor with just a single
00:15:52.140 --> 00:15:56.280
indicator then we cannot estimate the
reliability of the indicator because
00:15:56.280 --> 00:16:00.150
you cannot estimate reliability based
on just one measure. That's the idea.
00:16:00.150 --> 00:16:07.770
We have to assume what is their variance and
typically we do that by constraining the error
00:16:07.770 --> 00:16:14.370
variance to be zero. So we say that this factor A
or construct A is measured without any error if we
00:16:14.370 --> 00:16:19.470
can't estimate it. Of course we could constraint
the error variance to be something else. If we
00:16:19.470 --> 00:16:26.610
know that the indicator has typically shown to
be eighty percent reliable - then we can fix this
00:16:26.610 --> 00:16:33.060
variance here to be a 80 percent of the observed
variance of the indicator but that's rarely done.
00:16:33.060 --> 00:16:37.290
So identification is a requirement for estimation.
00:16:37.290 --> 00:16:41.040
If our model is not identified it
cannot be meaningfully estimated.
00:16:41.040 --> 00:16:45.870
Identification basically means that do
you have enough information to estimate
00:16:45.870 --> 00:16:51.360
the model. If we have one correlation
we can't estimate two different things
00:16:51.360 --> 00:16:56.460
from one correlation. You need at
least one unit of information for
00:16:56.460 --> 00:17:01.170
everything that you estimate ideally you
have more information so the redundancy.
00:17:01.170 --> 00:17:07.620
So we need to have a scale for inlatent
variables and the degrees of freedom must be
00:17:07.620 --> 00:17:14.400
non-negative. Ideally it is positive and the more
positive it is the better our model tests are.