WEBVTT

WEBVTT
Kind: captions
Language: en

00:00:00.120 --> 00:00:03.480
There are two things that we need 
to consider before we can even  

00:00:03.480 --> 00:00:09.660
start estimating confirmatory factor analysis 
model called scale setting and identification.

00:00:09.660 --> 00:00:14.790
The scale setting means that every variable 
must have a metric. So we have to be able to  

00:00:14.790 --> 00:00:21.150
estimate the variance and sometimes the mean 
of every variable. An identification means  

00:00:21.150 --> 00:00:26.460
that the data provides enough information to 
estimate the model that we want to estimate.

00:00:26.460 --> 00:00:34.080
So the confirmatory factor analysis framework is 
very flexible and it's possible to define models  

00:00:34.080 --> 00:00:40.080
that are mathematically impossible to estimate 
uniquely. So in this video we will go through  

00:00:40.080 --> 00:00:45.870
what requirements you have to consider before 
you can even estimate the model meaningfully.

00:00:45.870 --> 00:00:50.520
Let's take a look at this model 
with just two indicators. We have  

00:00:50.520 --> 00:00:57.420
indicator a1 and a2 and then we want 
to estimate factor A. And we have two  

00:00:57.420 --> 00:01:01.470
variances - these two error variances here 
- and then we have two factor loadings. So  

00:01:01.470 --> 00:01:06.990
we have four things that we want to 
estimate and so four free parameters.

00:01:06.990 --> 00:01:13.680
Then we start estimating it. We calculate the 
model implied correlations. So we have two  

00:01:13.680 --> 00:01:20.970
variances. Variance of a3 a2 and variance of 
a1 and then one correlation. So we have three  

00:01:20.970 --> 00:01:28.290
unique elements of information from the data 
that we model using these four parameters.

00:01:28.290 --> 00:01:34.260
The problem is that now we have three 
units of information and we have four  

00:01:34.260 --> 00:01:37.590
things that we want to estimate. 
So the degrees of freedom is minus  

00:01:37.590 --> 00:01:42.150
one and that can be estimated or 
it can be estimated meaningfully.

00:01:42.150 --> 00:01:48.300
The reason is that or intuitive 
understanding insists that are  

00:01:48.300 --> 00:01:53.400
you cannot estimate four things from 
a three things. So that's the idea.

00:01:53.400 --> 00:02:00.840
You have to have more information than 
what you want to estimate. So this is  

00:02:00.840 --> 00:02:05.430
not identified and there are ways that 
we can simplify the model to actually be  

00:02:05.430 --> 00:02:09.330
able to estimate something or we can add 
more indicators to make it identified.

00:02:09.330 --> 00:02:16.020
So this is not identified because the degrees 
of freedom is negative. And factor analysis  

00:02:16.020 --> 00:02:21.570
without additional constraints always 
requires at least three indicators.

00:02:21.570 --> 00:02:28.050
Factor analysis of two indicators only it's not a 
very meaningful analysis anyway because while you  

00:02:28.050 --> 00:02:34.920
can make it identified by saying that these factor 
loadings for example are the same - that would  

00:02:34.920 --> 00:02:40.980
identify the model - then the estimation wouldn't 
give you any meaningful information anyway.

00:02:40.980 --> 00:02:48.600
So let's take another example or work more 
with this example. So let's assume that our  

00:02:48.600 --> 00:02:56.490
correlation matrix for this two factor model 
each with one indicator is - so we have a1 and  

00:02:56.490 --> 00:03:02.970
b1 they're corralated at 0.1 one and we have 
three parameters that we want to estimate.

00:03:02.970 --> 00:03:10.800
So you can't. We have one correlation that depends 
on three parameters and these other variances  

00:03:10.800 --> 00:03:16.860
don't depend on the model or they do depend but 
we don't really care about those in this video.

00:03:16.860 --> 00:03:24.030
So why is the correlation with a1 and b1 
so low? There are basically three different  

00:03:24.030 --> 00:03:30.810
options. It's possible that a1 and b1 are both 
highly reliable indicators of these factors A  

00:03:30.810 --> 00:03:38.910
and B. It's also possible that A and B are 
just weakly correlated. It's also possible  

00:03:38.910 --> 00:03:45.150
that A and B are highly correlated but a1 is 
unreliable and therefore we observe only a  

00:03:45.150 --> 00:03:51.000
small correlation or it's possible that A and 
B are highly correlated but b1 is unreliable.

00:03:51.000 --> 00:03:57.240
The problem is that we cannot know 
which of these three options is correct  

00:03:57.240 --> 00:04:02.040
because they all have the same empirical 
implication which is that this correlation  

00:04:02.040 --> 00:04:08.610
here is quite small. So that's another 
example of non identification problem.

00:04:08.610 --> 00:04:14.700
Here we are estimating five things so we have two 
error variances. We have two factor loadings and  

00:04:14.700 --> 00:04:20.040
one factor correlation. We are trying to estimate 
it from just three elements of information.

00:04:20.040 --> 00:04:23.340
We can't do that. The model is not identified. We  

00:04:23.340 --> 00:04:27.600
cannot know which one of these three 
explanations is correct empirically.

00:04:27.600 --> 00:04:34.950
Of course we can then use theory and rule out one 
of these base alternate explanations - based on  

00:04:34.950 --> 00:04:39.270
theory - but that goes beyond our factor 
analysis estimates and identification.

00:04:39.270 --> 00:04:43.650
So this model is not identified. It 
cannot be estimated meaningfully.

00:04:43.650 --> 00:04:48.930
Let's take a look at scale setting 
now. So the identification basically  

00:04:48.930 --> 00:04:54.780
means that you have more information 
than what you estimate. So the number  

00:04:54.780 --> 00:05:00.750
of unique elements in the correlation matrix 
of the indicators must exceed or be the same  

00:05:00.750 --> 00:05:03.570
as the number of three parameters 
that you estimate from the model.

00:05:03.570 --> 00:05:11.160
Okay. So normally we have - in exploratory factor 
analysis we have standardized factors - so the  

00:05:11.160 --> 00:05:16.950
idea is that all the factors have variances of 
one means of zero in the exploratory analysis  

00:05:16.950 --> 00:05:23.250
and that defines the scale of these variables. So 
every variable must have a variance in exploratory  

00:05:23.250 --> 00:05:30.030
analysis. The factors are scaled to have unit 
variance so they're standardized and then all  

00:05:30.030 --> 00:05:34.230
the factor loadings are then standardized 
regression coefficients for that reason.

00:05:34.230 --> 00:05:42.000
Then what if we don't standardize the factor 
so we are saying that instead of saying that  

00:05:42.000 --> 00:05:49.230
the factors variance is one we are estimating 
the factors variances. So we add these factor  

00:05:49.230 --> 00:05:53.880
variances here and factor variance here so 
we have fifteen free parameters. We still  

00:05:53.880 --> 00:05:59.400
have 21 units of information from which we 
estimate but we estimate 15 different things  

00:05:59.400 --> 00:06:06.600
so the degrees of freedom is 6 which means 
that this model is overidentified. So it's  

00:06:06.600 --> 00:06:11.490
positive. So in principle it is possible 
to estimate this model meaningfully.

00:06:11.490 --> 00:06:18.030
We can do the estimation. So let's assume 
that that's our observed correlation matrix.  

00:06:18.030 --> 00:06:24.240
That's our implied correlation matrix. 
Then we can find the values for the Y  

00:06:24.240 --> 00:06:32.610
and the lambdas. So that this employed matrix 
reproduces this correlation matrix perfectly.

00:06:32.610 --> 00:06:38.370
In this case that's possible because 
these correlations all have the same  

00:06:38.370 --> 00:06:41.580
values. Generally in small samples you will never  

00:06:41.580 --> 00:06:45.960
completely reproduce the data but in this 
example you do just to simplify things.

00:06:45.960 --> 00:06:54.570
So we can estimate and that's one set of 
estimates that will give you the exact fit  

00:06:54.570 --> 00:06:58.710
between the observed variable observed correlation 
matrix and the implied correlation matrix.

00:06:58.710 --> 00:07:09.270
So we're fine right. Turns out we have a small 
problem because there's another set of estimates  

00:07:09.270 --> 00:07:14.730
that also reproduce the correlation matrix 
perfectly using the employment correlation  

00:07:14.730 --> 00:07:22.080
matrix. So you can plug in these values to 
the equations and see that they produce the  

00:07:22.080 --> 00:07:28.890
exact same implied correlations. So we have 
here factor A's variance is 1 versus factor  

00:07:28.890 --> 00:07:36.450
B's variance 2 and therefore they are 
produced the same fit. So what do we  

00:07:36.450 --> 00:07:43.590
do? We can go and come up with indefinitely 
many examples. So if factor A's variance is  

00:07:43.590 --> 00:07:51.390
0.5 then we will all have a different values 
with factor loadings but still the empirical  

00:07:51.390 --> 00:07:56.130
correlation matrix is reproduced perfectly 
using the model implied correlation matrix.

00:07:56.130 --> 00:08:03.420
So this the problem of scale setting of latent 
variables in confirmatory factor analysis models.

00:08:03.420 --> 00:08:12.810
So we need to set the metric. So the factors 
themselves because we don't observe the factors  

00:08:12.810 --> 00:08:18.780
they are just arbitrary entries we don't know 
whether they vary from 0 to 1 or 0 to 1 million  

00:08:18.780 --> 00:08:25.350
or minus 5 to plus 10 or whatever. We don't know 
their range. We don't know their variances. We  

00:08:25.350 --> 00:08:32.310
don't know their means. We have to specify the 
scale of each latent each factor ourselves.

00:08:32.310 --> 00:08:37.590
In exploratory analysis we typically 
don't model means and then we assume  

00:08:37.590 --> 00:08:41.670
that the variances or we fix the 
variances of the factors to be ones.

00:08:41.670 --> 00:08:52.650
In confirmatory analysis there are reasons why 
we don't fix the variances to pons. That I'll  

00:08:52.650 --> 00:09:03.390
explain a bit later. But the problem generally 
is that we must define whether we are talking  

00:09:03.390 --> 00:09:10.980
about centimetres or inches - do we talk about 
Celsius or Fahrenheit. They quantify the same  

00:09:10.980 --> 00:09:19.140
exact thing and they are equally good measures 
from a statistical perspective to measure of  

00:09:19.140 --> 00:09:23.910
length or temperature. We have to agree 
on what is the scale that you're using.

00:09:23.910 --> 00:09:32.580
So also a regression gives us the one 
unit change - the effect of one unit  

00:09:32.580 --> 00:09:37.050
changing in the independent variable 
on the dependent variable - considering  

00:09:37.050 --> 00:09:42.540
regression coefficients only makes 
sense after we have considered how  

00:09:42.540 --> 00:09:48.120
we define the unit. So what is the 
unit of A and what is the unit of B.

00:09:48.120 --> 00:09:55.440
We have to set them manually. So we have to 
decide a scale setting approach. In exploratory  

00:09:55.440 --> 00:10:01.800
analysis as I said we typically say that factor 
A and factor B on all factors have variances of  

00:10:01.800 --> 00:10:08.550
one. That produces standardized factor loadings 
which are standard that regression coefficients  

00:10:08.550 --> 00:10:16.110
of the indicators on the factors or in the case 
of uncorrelated factors they equal correlations.

00:10:16.110 --> 00:10:22.830
We use that in exploratory factor analysis. 
We cannot use that in structure regression  

00:10:22.830 --> 00:10:26.880
model. Structure regression model 
is an extension of a factor analysis  

00:10:26.880 --> 00:10:30.810
model where we allow regressing 
relationships between the factors.

00:10:30.810 --> 00:10:36.090
The reason why we can't use this 
approach is that the variation of  

00:10:36.090 --> 00:10:41.370
an endogenous variable - so a variable that 
depends on other variables - is the sum of  

00:10:41.370 --> 00:10:47.070
those other variables. So we can't say the 
variable's variance is one if that variance  

00:10:47.070 --> 00:10:51.300
depends on other things in the model. 
But that's that's beyond this video.

00:10:51.300 --> 00:11:00.570
Another very common approach is that we set 
the first indicator to be fixed the first  

00:11:00.570 --> 00:11:05.610
indicators loading to be one. And this 
is the default scale setting approach in  

00:11:05.610 --> 00:11:10.290
most structural regression modelling or 
confirmatory factor analysis software.

00:11:10.290 --> 00:11:16.830
The reason is that this can be used pretty much 
always regardless of what kind of variables  

00:11:16.830 --> 00:11:21.420
we have here as A and B and what kind of 
relationship will be specified between A  

00:11:21.420 --> 00:11:29.370
and B. And the idea is that we scale that 
- if we assume that classical test theory  

00:11:29.370 --> 00:11:36.420
holds - so all these errors here are just 
random noise - then the variance of A is  

00:11:36.420 --> 00:11:43.950
whatever is the variance of the true score of 
a1. So that's also appealing if we consider  

00:11:43.950 --> 00:11:51.660
that the only source of error is random noise - 
then the variance of factor A is the variation  

00:11:51.660 --> 00:11:57.420
of a1 or what the various in a1 would be if it 
wasn't contaminated with this random noise here.

00:11:57.420 --> 00:12:05.520
So that's also a one way - one reason why 
this is appealing. It allows us to consider  

00:12:05.520 --> 00:12:14.730
the scale of these indicators without error varies 
assuming classical test theory holds for the data.

00:12:14.730 --> 00:12:21.150
And this is such a common approach 
that there's a rule of thumb that I  

00:12:21.150 --> 00:12:24.900
present. Always use the first 
indicators to fix the scale.

00:12:24.900 --> 00:12:32.670
We can see that the papers - that we have used 
as examples in these videos - are using this  

00:12:32.670 --> 00:12:39.840
approach. Mesquita and Lazzarini - you can see all 
loadings of first indicators are ones. So they set  

00:12:39.840 --> 00:12:46.830
the scale of the latent variable by fixing this 
loading to one and then they have the Z-statistic  

00:12:46.830 --> 00:12:54.060
here and you can see that the indicators - the 
first indicators - don't have a Z- statistic.  

00:12:54.060 --> 00:12:59.910
The reason is that they are not estimated from 
the data - instead a researcher says that these  

00:12:59.910 --> 00:13:05.730
are ones they are not estimated if something 
is not estimated it doesn't vary from sample  

00:13:05.730 --> 00:13:12.840
to sample. So it doesn't have a standard error. 
So we can't calculate or the Z-statistic for it.

00:13:12.840 --> 00:13:20.760
We can see the same in Yli-Renko's paper. So 
Yli-Renko's paper - the first loading it's not  

00:13:20.760 --> 00:13:27.000
one but it doesn't have a standard error and 
he doesn't have a Z-statistic he doesn't have  

00:13:27.000 --> 00:13:33.090
a standard error so that's indication that they 
actually are fix the first loading to be one to  

00:13:33.090 --> 00:13:39.750
identify or the scale the latent variables. If 
you want to have standardized factor loadings  

00:13:39.750 --> 00:13:47.160
so if you want to have loadings that are 
expressed in the scale of the Exploratory  

00:13:47.160 --> 00:13:53.880
analysis where the factor variances are ones 
then you can rescale the confirmatory factor  

00:13:53.880 --> 00:14:00.420
analysis results afterwards. Your software 
will produce that for you if you check the  

00:14:00.420 --> 00:14:05.520
standardized estimates option there. So 
these are standardized estimates but the  

00:14:05.520 --> 00:14:10.320
scaling has been done after estimation. 
So you first estimate and unstandardized  

00:14:10.320 --> 00:14:18.690
confirmatory factor analysis where each factor 
is scaled by fixing the first indicator - then  

00:14:18.690 --> 00:14:25.320
you scale the resulting solution. That's the same 
approach that you use for standardized regression  

00:14:25.320 --> 00:14:30.300
coefficients. You first estimate regression 
then you scale the parameter estimates later.

00:14:30.300 --> 00:14:36.270
So the summary of identification of 
confirmatory factor analysis models.  

00:14:36.270 --> 00:14:44.730
A model is identified if every latent variable 
has a scale and if the degrees of freedom is  

00:14:44.730 --> 00:14:50.940
positive for and it's also every part 
of the model has to be identified.

00:14:50.940 --> 00:14:57.870
In confirmatory factor analysis - after we have 
established every latent variable every factor has  

00:14:57.870 --> 00:15:05.130
a scale - then all factors with three indicators 
are always identified. So three indicators if you  

00:15:05.130 --> 00:15:10.710
have three variables you can always run a factor 
analysis no matter what. Then if you have two  

00:15:10.710 --> 00:15:18.270
factors -then we can either say that fix that 
both are equally reliable. So we fix the factor  

00:15:18.270 --> 00:15:27.690
loadings to be ones or we can embed this factor 
in a larger system. So just two variables alone  

00:15:27.690 --> 00:15:35.040
we can't estimate a factor model unless we fix 
these factor loadings to be the same. If we embed  

00:15:35.040 --> 00:15:42.540
this two factor - the two indicator factor 
- into a larger factor analysis then we can  

00:15:42.540 --> 00:15:47.940
estimate because we can use information from other 
indicators to estimate these factor loadings.

00:15:47.940 --> 00:15:52.140
And one single indicator rule - if 
we have a factor with just a single  

00:15:52.140 --> 00:15:56.280
indicator then we cannot estimate the 
reliability of the indicator because  

00:15:56.280 --> 00:16:00.150
you cannot estimate reliability based 
on just one measure. That's the idea.

00:16:00.150 --> 00:16:07.770
We have to assume what is their variance and 
typically we do that by constraining the error  

00:16:07.770 --> 00:16:14.370
variance to be zero. So we say that this factor A 
or construct A is measured without any error if we  

00:16:14.370 --> 00:16:19.470
can't estimate it. Of course we could constraint 
the error variance to be something else. If we  

00:16:19.470 --> 00:16:26.610
know that the indicator has typically shown to 
be eighty percent reliable - then we can fix this  

00:16:26.610 --> 00:16:33.060
variance here to be a 80 percent of the observed 
variance of the indicator but that's rarely done.

00:16:33.060 --> 00:16:37.290
So identification is a requirement for estimation.  

00:16:37.290 --> 00:16:41.040
If our model is not identified it 
cannot be meaningfully estimated.

00:16:41.040 --> 00:16:45.870
Identification basically means that do 
you have enough information to estimate  

00:16:45.870 --> 00:16:51.360
the model. If we have one correlation 
we can't estimate two different things  

00:16:51.360 --> 00:16:56.460
from one correlation. You need at 
least one unit of information for  

00:16:56.460 --> 00:17:01.170
everything that you estimate ideally you 
have more information so the redundancy.

00:17:01.170 --> 00:17:07.620
So we need to have a scale for inlatent 
variables and the degrees of freedom must be  

00:17:07.620 --> 00:17:14.400
non-negative. Ideally it is positive and the more 
positive it is the better our model tests are.