WEBVTT

WEBVTT
Kind: captions
Language: en

00:00:00.120 --> 00:00:03.480
There are two things that we need&nbsp;
to consider before we can even&nbsp;&nbsp;

00:00:03.480 --> 00:00:09.660
start estimating confirmatory factor analysis&nbsp;
model called scale setting and identification.

00:00:09.660 --> 00:00:14.790
The scale setting means that every variable&nbsp;
must have a metric. So we have to be able to&nbsp;&nbsp;

00:00:14.790 --> 00:00:21.150
estimate the variance and sometimes the mean&nbsp;
of every variable. An identification means&nbsp;&nbsp;

00:00:21.150 --> 00:00:26.460
that the data provides enough information to&nbsp;
estimate the model that we want to estimate.

00:00:26.460 --> 00:00:34.080
So the confirmatory factor analysis framework is&nbsp;
very flexible and it's possible to define models&nbsp;&nbsp;

00:00:34.080 --> 00:00:40.080
that are mathematically impossible to estimate&nbsp;
uniquely. So in this video we will go through&nbsp;&nbsp;

00:00:40.080 --> 00:00:45.870
what requirements you have to consider before&nbsp;
you can even estimate the model meaningfully.

00:00:45.870 --> 00:00:50.520
Let's take a look at this model&nbsp;
with just two indicators. We have&nbsp;&nbsp;

00:00:50.520 --> 00:00:57.420
indicator a1 and a2 and then we want&nbsp;
to estimate factor A. And we have two&nbsp;&nbsp;

00:00:57.420 --> 00:01:01.470
variances - these two error variances here&nbsp;
- and then we have two factor loadings. So&nbsp;&nbsp;

00:01:01.470 --> 00:01:06.990
we have four things that we want to&nbsp;
estimate and so four free parameters.

00:01:06.990 --> 00:01:13.680
Then we start estimating it. We calculate the&nbsp;
model implied correlations. So we have two&nbsp;&nbsp;

00:01:13.680 --> 00:01:20.970
variances. Variance of a3 a2 and variance of&nbsp;
a1 and then one correlation. So we have three&nbsp;&nbsp;

00:01:20.970 --> 00:01:28.290
unique elements of information from the data&nbsp;
that we model using these four parameters.

00:01:28.290 --> 00:01:34.260
The problem is that now we have three&nbsp;
units of information and we have four&nbsp;&nbsp;

00:01:34.260 --> 00:01:37.590
things that we want to estimate.&nbsp;
So the degrees of freedom is minus&nbsp;&nbsp;

00:01:37.590 --> 00:01:42.150
one and that can be estimated or&nbsp;
it can be estimated meaningfully.

00:01:42.150 --> 00:01:48.300
The reason is that or intuitive&nbsp;
understanding insists that are&nbsp;&nbsp;

00:01:48.300 --> 00:01:53.400
you cannot estimate four things from&nbsp;
a three things. So that's the idea.

00:01:53.400 --> 00:02:00.840
You have to have more information than&nbsp;
what you want to estimate. So this is&nbsp;&nbsp;

00:02:00.840 --> 00:02:05.430
not identified and there are ways that&nbsp;
we can simplify the model to actually be&nbsp;&nbsp;

00:02:05.430 --> 00:02:09.330
able to estimate something or we can add&nbsp;
more indicators to make it identified.

00:02:09.330 --> 00:02:16.020
So this is not identified because the degrees&nbsp;
of freedom is negative. And factor analysis&nbsp;&nbsp;

00:02:16.020 --> 00:02:21.570
without additional constraints always&nbsp;
requires at least three indicators.

00:02:21.570 --> 00:02:28.050
Factor analysis of two indicators only it's not a&nbsp;
very meaningful analysis anyway because while you&nbsp;&nbsp;

00:02:28.050 --> 00:02:34.920
can make it identified by saying that these factor&nbsp;
loadings for example are the same - that would&nbsp;&nbsp;

00:02:34.920 --> 00:02:40.980
identify the model - then the estimation wouldn't&nbsp;
give you any meaningful information anyway.

00:02:40.980 --> 00:02:48.600
So let's take another example or work more&nbsp;
with this example. So let's assume that our&nbsp;&nbsp;

00:02:48.600 --> 00:02:56.490
correlation matrix for this two factor model&nbsp;
each with one indicator is - so we have a1 and&nbsp;&nbsp;

00:02:56.490 --> 00:03:02.970
b1 they're corralated at 0.1 one and we have&nbsp;
three parameters that we want to estimate.

00:03:02.970 --> 00:03:10.800
So you can't. We have one correlation that depends&nbsp;
on three parameters and these other variances&nbsp;&nbsp;

00:03:10.800 --> 00:03:16.860
don't depend on the model or they do depend but&nbsp;
we don't really care about those in this video.

00:03:16.860 --> 00:03:24.030
So why is the correlation with a1 and b1&nbsp;
so low? There are basically three different&nbsp;&nbsp;

00:03:24.030 --> 00:03:30.810
options. It's possible that a1 and b1 are both&nbsp;
highly reliable indicators of these factors A&nbsp;&nbsp;

00:03:30.810 --> 00:03:38.910
and B. It's also possible that A and B are&nbsp;
just weakly correlated. It's also possible&nbsp;&nbsp;

00:03:38.910 --> 00:03:45.150
that A and B are highly correlated but a1 is&nbsp;
unreliable and therefore we observe only a&nbsp;&nbsp;

00:03:45.150 --> 00:03:51.000
small correlation or it's possible that A and&nbsp;
B are highly correlated but b1 is unreliable.

00:03:51.000 --> 00:03:57.240
The problem is that we cannot know&nbsp;
which of these three options is correct&nbsp;&nbsp;

00:03:57.240 --> 00:04:02.040
because they all have the same empirical&nbsp;
implication which is that this correlation&nbsp;&nbsp;

00:04:02.040 --> 00:04:08.610
here is quite small. So that's another&nbsp;
example of non identification problem.

00:04:08.610 --> 00:04:14.700
Here we are estimating five things so we have two&nbsp;
error variances. We have two factor loadings and&nbsp;&nbsp;

00:04:14.700 --> 00:04:20.040
one factor correlation. We are trying to estimate&nbsp;
it from just three elements of information.

00:04:20.040 --> 00:04:23.340
We can't do that. The model is not identified. We&nbsp;&nbsp;

00:04:23.340 --> 00:04:27.600
cannot know which one of these three&nbsp;
explanations is correct empirically.

00:04:27.600 --> 00:04:34.950
Of course we can then use theory and rule out one&nbsp;
of these base alternate explanations - based on&nbsp;&nbsp;

00:04:34.950 --> 00:04:39.270
theory - but that goes beyond our factor&nbsp;
analysis estimates and identification.

00:04:39.270 --> 00:04:43.650
So this model is not identified. It&nbsp;
cannot be estimated meaningfully.

00:04:43.650 --> 00:04:48.930
Let's take a look at scale setting&nbsp;
now. So the identification basically&nbsp;&nbsp;

00:04:48.930 --> 00:04:54.780
means that you have more information&nbsp;
than what you estimate. So the number&nbsp;&nbsp;

00:04:54.780 --> 00:05:00.750
of unique elements in the correlation matrix&nbsp;
of the indicators must exceed or be the same&nbsp;&nbsp;

00:05:00.750 --> 00:05:03.570
as the number of three parameters&nbsp;
that you estimate from the model.

00:05:03.570 --> 00:05:11.160
Okay. So normally we have - in exploratory factor&nbsp;
analysis we have standardized factors - so the&nbsp;&nbsp;

00:05:11.160 --> 00:05:16.950
idea is that all the factors have variances of&nbsp;
one means of zero in the exploratory analysis&nbsp;&nbsp;

00:05:16.950 --> 00:05:23.250
and that defines the scale of these variables. So&nbsp;
every variable must have a variance in exploratory&nbsp;&nbsp;

00:05:23.250 --> 00:05:30.030
analysis. The factors are scaled to have unit&nbsp;
variance so they're standardized and then all&nbsp;&nbsp;

00:05:30.030 --> 00:05:34.230
the factor loadings are then standardized&nbsp;
regression coefficients for that reason.

00:05:34.230 --> 00:05:42.000
Then what if we don't standardize the factor&nbsp;
so we are saying that instead of saying that&nbsp;&nbsp;

00:05:42.000 --> 00:05:49.230
the factors variance is one we are estimating&nbsp;
the factors variances. So we add these factor&nbsp;&nbsp;

00:05:49.230 --> 00:05:53.880
variances here and factor variance here so&nbsp;
we have fifteen free parameters. We still&nbsp;&nbsp;

00:05:53.880 --> 00:05:59.400
have 21 units of information from which we&nbsp;
estimate but we estimate 15 different things&nbsp;&nbsp;

00:05:59.400 --> 00:06:06.600
so the degrees of freedom is 6 which means&nbsp;
that this model is overidentified. So it's&nbsp;&nbsp;

00:06:06.600 --> 00:06:11.490
positive. So in principle it is possible&nbsp;
to estimate this model meaningfully.

00:06:11.490 --> 00:06:18.030
We can do the estimation. So let's assume&nbsp;
that that's our observed correlation matrix.&nbsp;&nbsp;

00:06:18.030 --> 00:06:24.240
That's our implied correlation matrix.&nbsp;
Then we can find the values for the Y&nbsp;&nbsp;

00:06:24.240 --> 00:06:32.610
and the lambdas. So that this employed matrix&nbsp;
reproduces this correlation matrix perfectly.

00:06:32.610 --> 00:06:38.370
In this case that's possible because&nbsp;
these correlations all have the same&nbsp;&nbsp;

00:06:38.370 --> 00:06:41.580
values. Generally in small samples you will never&nbsp;&nbsp;

00:06:41.580 --> 00:06:45.960
completely reproduce the data but in this&nbsp;
example you do just to simplify things.

00:06:45.960 --> 00:06:54.570
So we can estimate and that's one set of&nbsp;
estimates that will give you the exact fit&nbsp;&nbsp;

00:06:54.570 --> 00:06:58.710
between the observed variable observed correlation&nbsp;
matrix and the implied correlation matrix.

00:06:58.710 --> 00:07:09.270
So we're fine right. Turns out we have a small&nbsp;
problem because there's another set of estimates&nbsp;&nbsp;

00:07:09.270 --> 00:07:14.730
that also reproduce the correlation matrix&nbsp;
perfectly using the employment correlation&nbsp;&nbsp;

00:07:14.730 --> 00:07:22.080
matrix. So you can plug in these values to&nbsp;
the equations and see that they produce the&nbsp;&nbsp;

00:07:22.080 --> 00:07:28.890
exact same implied correlations. So we have&nbsp;
here factor A's variance is 1 versus factor&nbsp;&nbsp;

00:07:28.890 --> 00:07:36.450
B's variance 2 and therefore they are&nbsp;
produced the same fit. So what do we&nbsp;&nbsp;

00:07:36.450 --> 00:07:43.590
do? We can go and come up with indefinitely&nbsp;
many examples. So if factor A's variance is&nbsp;&nbsp;

00:07:43.590 --> 00:07:51.390
0.5 then we will all have a different values&nbsp;
with factor loadings but still the empirical&nbsp;&nbsp;

00:07:51.390 --> 00:07:56.130
correlation matrix is reproduced perfectly&nbsp;
using the model implied correlation matrix.

00:07:56.130 --> 00:08:03.420
So this the problem of scale setting of latent&nbsp;
variables in confirmatory factor analysis models.

00:08:03.420 --> 00:08:12.810
So we need to set the metric. So the factors&nbsp;
themselves because we don't observe the factors&nbsp;&nbsp;

00:08:12.810 --> 00:08:18.780
they are just arbitrary entries we don't know&nbsp;
whether they vary from 0 to 1 or 0 to 1 million&nbsp;&nbsp;

00:08:18.780 --> 00:08:25.350
or minus 5 to plus 10 or whatever. We don't know&nbsp;
their range. We don't know their variances. We&nbsp;&nbsp;

00:08:25.350 --> 00:08:32.310
don't know their means. We have to specify the&nbsp;
scale of each latent each factor ourselves.

00:08:32.310 --> 00:08:37.590
In exploratory analysis we typically&nbsp;
don't model means and then we assume&nbsp;&nbsp;

00:08:37.590 --> 00:08:41.670
that the variances or we fix the&nbsp;
variances of the factors to be ones.

00:08:41.670 --> 00:08:52.650
In confirmatory analysis there are reasons why&nbsp;
we don't fix the variances to pons. That I'll&nbsp;&nbsp;

00:08:52.650 --> 00:09:03.390
explain a bit later. But the problem generally&nbsp;
is that we must define whether we are talking&nbsp;&nbsp;

00:09:03.390 --> 00:09:10.980
about centimetres or inches - do we talk about&nbsp;
Celsius or Fahrenheit. They quantify the same&nbsp;&nbsp;

00:09:10.980 --> 00:09:19.140
exact thing and they are equally good measures&nbsp;
from a statistical perspective to measure of&nbsp;&nbsp;

00:09:19.140 --> 00:09:23.910
length or temperature. We have to agree&nbsp;
on what is the scale that you're using.

00:09:23.910 --> 00:09:32.580
So also a regression gives us the one&nbsp;
unit change - the effect of one unit&nbsp;&nbsp;

00:09:32.580 --> 00:09:37.050
changing in the independent variable&nbsp;
on the dependent variable - considering&nbsp;&nbsp;

00:09:37.050 --> 00:09:42.540
regression coefficients only makes&nbsp;
sense after we have considered how&nbsp;&nbsp;

00:09:42.540 --> 00:09:48.120
we define the unit. So what is the&nbsp;
unit of A and what is the unit of B.

00:09:48.120 --> 00:09:55.440
We have to set them manually. So we have to&nbsp;
decide a scale setting approach. In exploratory&nbsp;&nbsp;

00:09:55.440 --> 00:10:01.800
analysis as I said we typically say that factor&nbsp;
A and factor B on all factors have variances of&nbsp;&nbsp;

00:10:01.800 --> 00:10:08.550
one. That produces standardized factor loadings&nbsp;
which are standard that regression coefficients&nbsp;&nbsp;

00:10:08.550 --> 00:10:16.110
of the indicators on the factors or in the case&nbsp;
of uncorrelated factors they equal correlations.

00:10:16.110 --> 00:10:22.830
We use that in exploratory factor analysis.&nbsp;
We cannot use that in structure regression&nbsp;&nbsp;

00:10:22.830 --> 00:10:26.880
model. Structure regression model&nbsp;
is an extension of a factor analysis&nbsp;&nbsp;

00:10:26.880 --> 00:10:30.810
model where we allow regressing&nbsp;
relationships between the factors.

00:10:30.810 --> 00:10:36.090
The reason why we can't use this&nbsp;
approach is that the variation of&nbsp;&nbsp;

00:10:36.090 --> 00:10:41.370
an endogenous variable - so a variable that&nbsp;
depends on other variables - is the sum of&nbsp;&nbsp;

00:10:41.370 --> 00:10:47.070
those other variables. So we can't say the&nbsp;
variable's variance is one if that variance&nbsp;&nbsp;

00:10:47.070 --> 00:10:51.300
depends on other things in the model.&nbsp;
But that's that's beyond this video.

00:10:51.300 --> 00:11:00.570
Another very common approach is that we set&nbsp;
the first indicator to be fixed the first&nbsp;&nbsp;

00:11:00.570 --> 00:11:05.610
indicators loading to be one. And this&nbsp;
is the default scale setting approach in&nbsp;&nbsp;

00:11:05.610 --> 00:11:10.290
most structural regression modelling or&nbsp;
confirmatory factor analysis software.

00:11:10.290 --> 00:11:16.830
The reason is that this can be used pretty much&nbsp;
always regardless of what kind of variables&nbsp;&nbsp;

00:11:16.830 --> 00:11:21.420
we have here as A and B and what kind of&nbsp;
relationship will be specified between A&nbsp;&nbsp;

00:11:21.420 --> 00:11:29.370
and B. And the idea is that we scale that&nbsp;
- if we assume that classical test theory&nbsp;&nbsp;

00:11:29.370 --> 00:11:36.420
holds - so all these errors here are just&nbsp;
random noise - then the variance of A is&nbsp;&nbsp;

00:11:36.420 --> 00:11:43.950
whatever is the variance of the true score of&nbsp;
a1. So that's also appealing if we consider&nbsp;&nbsp;

00:11:43.950 --> 00:11:51.660
that the only source of error is random noise -&nbsp;
then the variance of factor A is the variation&nbsp;&nbsp;

00:11:51.660 --> 00:11:57.420
of a1 or what the various in a1 would be if it&nbsp;
wasn't contaminated with this random noise here.

00:11:57.420 --> 00:12:05.520
So that's also a one way - one reason why&nbsp;
this is appealing. It allows us to consider&nbsp;&nbsp;

00:12:05.520 --> 00:12:14.730
the scale of these indicators without error varies&nbsp;
assuming classical test theory holds for the data.

00:12:14.730 --> 00:12:21.150
And this is such a common approach&nbsp;
that there's a rule of thumb that I&nbsp;&nbsp;

00:12:21.150 --> 00:12:24.900
present. Always use the first&nbsp;
indicators to fix the scale.

00:12:24.900 --> 00:12:32.670
We can see that the papers - that we have used&nbsp;
as examples in these videos - are using this&nbsp;&nbsp;

00:12:32.670 --> 00:12:39.840
approach. Mesquita and Lazzarini - you can see all&nbsp;
loadings of first indicators are ones. So they set&nbsp;&nbsp;

00:12:39.840 --> 00:12:46.830
the scale of the latent variable by fixing this&nbsp;
loading to one and then they have the Z-statistic&nbsp;&nbsp;

00:12:46.830 --> 00:12:54.060
here and you can see that the indicators - the&nbsp;
first indicators - don't have a Z- statistic.&nbsp;&nbsp;

00:12:54.060 --> 00:12:59.910
The reason is that they are not estimated from&nbsp;
the data - instead a researcher says that these&nbsp;&nbsp;

00:12:59.910 --> 00:13:05.730
are ones they are not estimated if something&nbsp;
is not estimated it doesn't vary from sample&nbsp;&nbsp;

00:13:05.730 --> 00:13:12.840
to sample. So it doesn't have a standard error.&nbsp;
So we can't calculate or the Z-statistic for it.

00:13:12.840 --> 00:13:20.760
We can see the same in Yli-Renko's paper. So&nbsp;
Yli-Renko's paper - the first loading it's not&nbsp;&nbsp;

00:13:20.760 --> 00:13:27.000
one but it doesn't have a standard error and&nbsp;
he doesn't have a Z-statistic he doesn't have&nbsp;&nbsp;

00:13:27.000 --> 00:13:33.090
a standard error so that's indication that they&nbsp;
actually are fix the first loading to be one to&nbsp;&nbsp;

00:13:33.090 --> 00:13:39.750
identify or the scale the latent variables. If&nbsp;
you want to have standardized factor loadings&nbsp;&nbsp;

00:13:39.750 --> 00:13:47.160
so if you want to have loadings that are&nbsp;
expressed in the scale of the Exploratory&nbsp;&nbsp;

00:13:47.160 --> 00:13:53.880
analysis where the factor variances are ones&nbsp;
then you can rescale the confirmatory factor&nbsp;&nbsp;

00:13:53.880 --> 00:14:00.420
analysis results afterwards. Your software&nbsp;
will produce that for you if you check the&nbsp;&nbsp;

00:14:00.420 --> 00:14:05.520
standardized estimates option there. So&nbsp;
these are standardized estimates but the&nbsp;&nbsp;

00:14:05.520 --> 00:14:10.320
scaling has been done after estimation.&nbsp;
So you first estimate and unstandardized&nbsp;&nbsp;

00:14:10.320 --> 00:14:18.690
confirmatory factor analysis where each factor&nbsp;
is scaled by fixing the first indicator - then&nbsp;&nbsp;

00:14:18.690 --> 00:14:25.320
you scale the resulting solution. That's the same&nbsp;
approach that you use for standardized regression&nbsp;&nbsp;

00:14:25.320 --> 00:14:30.300
coefficients. You first estimate regression&nbsp;
then you scale the parameter estimates later.

00:14:30.300 --> 00:14:36.270
So the summary of identification of&nbsp;
confirmatory factor analysis models.&nbsp;&nbsp;

00:14:36.270 --> 00:14:44.730
A model is identified if every latent variable&nbsp;
has a scale and if the degrees of freedom is&nbsp;&nbsp;

00:14:44.730 --> 00:14:50.940
positive for and it's also every part&nbsp;
of the model has to be identified.

00:14:50.940 --> 00:14:57.870
In confirmatory factor analysis - after we have&nbsp;
established every latent variable every factor has&nbsp;&nbsp;

00:14:57.870 --> 00:15:05.130
a scale - then all factors with three indicators&nbsp;
are always identified. So three indicators if you&nbsp;&nbsp;

00:15:05.130 --> 00:15:10.710
have three variables you can always run a factor&nbsp;
analysis no matter what. Then if you have two&nbsp;&nbsp;

00:15:10.710 --> 00:15:18.270
factors -then we can either say that fix that&nbsp;
both are equally reliable. So we fix the factor&nbsp;&nbsp;

00:15:18.270 --> 00:15:27.690
loadings to be ones or we can embed this factor&nbsp;
in a larger system. So just two variables alone&nbsp;&nbsp;

00:15:27.690 --> 00:15:35.040
we can't estimate a factor model unless we fix&nbsp;
these factor loadings to be the same. If we embed&nbsp;&nbsp;

00:15:35.040 --> 00:15:42.540
this two factor - the two indicator factor&nbsp;
- into a larger factor analysis then we can&nbsp;&nbsp;

00:15:42.540 --> 00:15:47.940
estimate because we can use information from other&nbsp;
indicators to estimate these factor loadings.

00:15:47.940 --> 00:15:52.140
And one single indicator rule - if&nbsp;
we have a factor with just a single&nbsp;&nbsp;

00:15:52.140 --> 00:15:56.280
indicator then we cannot estimate the&nbsp;
reliability of the indicator because&nbsp;&nbsp;

00:15:56.280 --> 00:16:00.150
you cannot estimate reliability based&nbsp;
on just one measure. That's the idea.

00:16:00.150 --> 00:16:07.770
We have to assume what is their variance and&nbsp;
typically we do that by constraining the error&nbsp;&nbsp;

00:16:07.770 --> 00:16:14.370
variance to be zero. So we say that this factor A&nbsp;
or construct A is measured without any error if we&nbsp;&nbsp;

00:16:14.370 --> 00:16:19.470
can't estimate it. Of course we could constraint&nbsp;
the error variance to be something else. If we&nbsp;&nbsp;

00:16:19.470 --> 00:16:26.610
know that the indicator has typically shown to&nbsp;
be eighty percent reliable - then we can fix this&nbsp;&nbsp;

00:16:26.610 --> 00:16:33.060
variance here to be a 80 percent of the observed&nbsp;
variance of the indicator but that's rarely done.

00:16:33.060 --> 00:16:37.290
So identification is a requirement for estimation.&nbsp;&nbsp;

00:16:37.290 --> 00:16:41.040
If our model is not identified it&nbsp;
cannot be meaningfully estimated.

00:16:41.040 --> 00:16:45.870
Identification basically means that do&nbsp;
you have enough information to estimate&nbsp;&nbsp;

00:16:45.870 --> 00:16:51.360
the model. If we have one correlation&nbsp;
we can't estimate two different things&nbsp;&nbsp;

00:16:51.360 --> 00:16:56.460
from one correlation. You need at&nbsp;
least one unit of information for&nbsp;&nbsp;

00:16:56.460 --> 00:17:01.170
everything that you estimate ideally you&nbsp;
have more information so the redundancy.

00:17:01.170 --> 00:17:07.620
So we need to have a scale for inlatent&nbsp;
variables and the degrees of freedom must be&nbsp;&nbsp;

00:17:07.620 --> 00:17:14.400
non-negative. Ideally it is positive and the more&nbsp;
positive it is the better our model tests are.