WEBVTT
WEBVTT
Kind: captions
Language: en

00:00:00.030 --> 00:00:04.590
Confirmatory factor analysis differs 
from exploratory factor analysis in  

00:00:04.590 --> 00:00:09.210
that in a confirmatory factor analysis 
the researcher specifies the factors for  

00:00:09.210 --> 00:00:14.370
the data instead of having the computer 
discover what are the underlying factors.

00:00:14.370 --> 00:00:20.820
So confirmatory factor analysis requires that you 
specify the expected result and then computer will  

00:00:20.820 --> 00:00:25.500
tell you if that result fits your data and 
will estimate the factor loadings for you.

00:00:25.500 --> 00:00:29.970
This is a more flexible approach to factor 
analysis than an exploratory analysis.

00:00:29.970 --> 00:00:35.580
In this video I will demonstrate you on a 
conceptual level what confirmatory factor  

00:00:35.580 --> 00:00:39.780
analysis does and how it can be applied 
to more different kind of scenarios.

00:00:39.780 --> 00:00:50.190
Our data has six indicators. We have indicators a1 
through a3 that are supposed to measure construct  

00:00:50.190 --> 00:00:54.540
A so we can see that there is variation 
in these indicators due to construct A  

00:00:54.540 --> 00:00:59.580
and there is also some random noise the E 
here though that's unreliability of these  

00:00:59.580 --> 00:01:05.040
indicators and then we have some variance 
components that is reliable. So for example  

00:01:05.040 --> 00:01:11.850
if we measure a3 multiple times there is 
a specific part of the a3 that is reliable  

00:01:11.850 --> 00:01:18.750
but it is specific to a3. So for example if we 
ask whether company is innovative or not that  

00:01:18.750 --> 00:01:24.930
measures innovativeness it can also measure 
something else. So the unreliability is not  

00:01:24.930 --> 00:01:28.230
the only source of measurement error 
but there's also some item uniqueness.

00:01:28.230 --> 00:01:34.680
And the idea of a confirmatory factor 
analysis is that we specify factor model  

00:01:34.680 --> 00:01:40.110
for this day ourselves. So for example 
we would say that because these three  

00:01:40.110 --> 00:01:46.680
indicators of a1 through a3 are supposed 
to measure construct A - we assign them  

00:01:46.680 --> 00:01:53.730
to factor A and then we assign these to factor 
B and then each indicators gets an error term.

00:01:53.730 --> 00:02:00.480
Then the factor analysis takes the variance 
of those indicators apart into variants that  

00:02:00.480 --> 00:02:06.120
can be attributed to the factors and variance 
that can be attributed error terms. Like so.

00:02:06.120 --> 00:02:12.330
So now we have a factor solution here. We have 
all variation that is to due to the concept A  

00:02:12.330 --> 00:02:18.060
goes to factor A all variation that is due to 
the concept B goes to factor B and all these  

00:02:18.060 --> 00:02:23.880
item uniqueness and unreliability goes to the 
error terms that are assumed to be uncorrelated.

00:02:23.880 --> 00:02:30.240
So these uncorrelated distinct sources of 
variation for its indicator and then we have  

00:02:30.240 --> 00:02:32.790
the two common factors. So that's the ideal case.

00:02:32.790 --> 00:02:38.820
Sometimes your data are not as 
as great as you would like and  

00:02:38.820 --> 00:02:44.040
confirmatory factor analysis allows you to 
model also problems in your measurement.

00:02:44.040 --> 00:02:51.750
So for example if we have this kind of scenario. 
There is a variation again due to construct A  

00:02:51.750 --> 00:02:59.190
variation to the construct B and then we have 
unreliability - the black circles here - and  

00:02:59.190 --> 00:03:06.210
we have unique aspects of each indicator but 
there is also some variation in a3 and and  

00:03:06.210 --> 00:03:12.570
b1 that correlates. The variance component 
are here. These letters have no particular  

00:03:12.570 --> 00:03:16.920
meaning by the way. They're just letters 
to distinguish these are different circles.

00:03:16.920 --> 00:03:26.910
So a3 and b1 correlate for some other reason 
than measuring A and measuring B that could  

00:03:26.910 --> 00:03:34.740
possibly be correlated. If we fit a confirmatory 
factor analysis model then this variation that  

00:03:34.740 --> 00:03:41.670
is shared by a1 a3 and b1 actually goes to the 
factors. The reason why it goes the factors is  

00:03:41.670 --> 00:03:48.840
that these error terms are constrained to 
be uncorrelated. And what happens now is  

00:03:48.840 --> 00:03:55.320
that our factors - which are supposed to present 
construct A and construct B - are contaminated by  

00:03:55.320 --> 00:04:02.790
this secondary source of variation A and that 
does present in a3 and b1 and as a consequence  

00:04:02.790 --> 00:04:08.700
the correlation between factors A and B will be 
overestimated and your results will be biased.

00:04:08.700 --> 00:04:16.230
And this is also the case in exploratory 
factor analysis. So if you have two factors  

00:04:16.230 --> 00:04:22.800
that influence or if we have this kind of like 
minor factor the influence is a3 and b1 and we  

00:04:22.800 --> 00:04:28.620
only get two factors then the factor correlation 
between those two factors will be inflated. If  

00:04:28.620 --> 00:04:34.680
we were to run exploratory analysis then the 
exploratory analysis could identify that there  

00:04:34.680 --> 00:04:40.740
is a third factor that loads on all a3 and 
b1 but because it's just two indicators it's  

00:04:40.740 --> 00:04:45.720
also possible that the exploratory analysis 
wouldn't identify there that factor for us.

00:04:45.720 --> 00:04:52.230
So what can we do with this kind of situation? 
Confirmatory analysis allows us to also model  

00:04:52.230 --> 00:04:59.550
correlated error. So instead of specifying 
that these error terms of a3 and b1 are  

00:04:59.550 --> 00:05:05.640
uncorrelated we can say that it's possible 
that a3 and b1 correlate for some other  

00:05:05.640 --> 00:05:10.770
reason. So we relax the constraint. We 
specify that these two can be correlated  

00:05:10.770 --> 00:05:19.200
and then the variation in a3 and b1 that 
is shared between these indicators but  

00:05:19.200 --> 00:05:25.710
not with others - so it's not part of 
the factors - then gets to escape these  

00:05:25.710 --> 00:05:31.290
error terms and then we also again get clean 
estimate of the factor correlation A and B.

00:05:31.290 --> 00:05:39.090
But this is something that many people do. So your 
statistical software will tell you that the model  

00:05:39.090 --> 00:05:44.280
doesn't fit the data perfectly and it'll also 
tell you that you could freeze some correlations  

00:05:44.280 --> 00:05:50.190
to make the model fit better but that's a bit 
dangerous unless you know what you're doing.

00:05:50.190 --> 00:05:58.200
You should only add this kind of correlated errors 
if you have a good theoretical reason to do so. So  

00:05:58.200 --> 00:06:02.670
the fact that your statistical software tells you 
to do is that you could do something to increase  

00:06:02.670 --> 00:06:11.280
the model fit is another reason to do something 
It's an indication that you could do something and  

00:06:11.280 --> 00:06:17.670
you should consider something. It's not a definite 
guideline that you should actually do that.

00:06:17.670 --> 00:06:25.260
So under which scenario then are you - is 
it a good idea to allow the terms of two  

00:06:25.260 --> 00:06:31.080
indicators to correlate? Dor example if our 
indicators would look like that. So we would  

00:06:31.080 --> 00:06:37.035
have indicators about innovativeness. So A here 
- A factor here's innovativeness - and B factory  

00:06:37.035 --> 00:06:42.630
is productivity. So we would have questions about 
innovativeness and questions about productivity.

00:06:42.630 --> 00:06:51.390
Then we realized that okay so a3 is 
our personal is innovative and b1 is  

00:06:51.390 --> 00:06:56.040
our personal is productive. So both 
of these actually have these this  

00:06:56.040 --> 00:07:01.020
personnel dimension as well. So they 
don't measure only innovativeness and  

00:07:01.020 --> 00:07:06.600
productivity they also measure how high 
quality the personnel in the company are.

00:07:06.600 --> 00:07:11.070
So then we realize that okay so there 
is a secondary dimension that these two  

00:07:11.070 --> 00:07:16.230
indicators measure and then we can add the 
error correlation here. But also you have  

00:07:16.230 --> 00:07:22.470
to justify it. So it's not enough that you say 
that statistical software tells us that the more  

00:07:22.470 --> 00:07:27.720
of it fits better if we do something you have 
to justify it also in non statistical terms.

00:07:27.720 --> 00:07:32.730
This is the same thing like with the outliers 
- you don't delete an observation because it  

00:07:32.730 --> 00:07:38.100
is different you have to explain why it's 
different in non statistical terms. The same  

00:07:38.100 --> 00:07:45.090
thing when you eliminate indicators from a scale. 
So your statistical software will tell you that  

00:07:45.090 --> 00:07:51.990
sometimes eliminating an item from a scale will 
make cronbach's alpha to go up but that's not  

00:07:51.990 --> 00:07:57.660
a reason to eliminate an item. You should also 
look at non statistical criteria. So what does  

00:07:57.660 --> 00:08:03.990
the item look like is there a good reason why 
we think it's less reliable. Because this kind  

00:08:03.990 --> 00:08:10.530
of suggestions by your software they could also 
be just a random correlation between two random  

00:08:10.530 --> 00:08:16.470
elements - so random correlation between these E's 
and then you would be missed specifying the model.

00:08:16.470 --> 00:08:23.640
Another way - perhaps a bit better way - 
to accomplish the same is to specify this  

00:08:23.640 --> 00:08:30.060
secondary factor. So we could instead saying that 
these two errors are uncorrelated we could say  

00:08:30.060 --> 00:08:36.690
that these indicators a3 and b1 actually are 
also measuring something else. So we add this  

00:08:36.690 --> 00:08:44.160
secondary factor here and this is a bit more 
appealing approach because then it makes - you  

00:08:44.160 --> 00:08:52.020
have to explicitly then interpret what this 
factor means and it's a lot easier to free  

00:08:52.020 --> 00:08:57.360
correlations without explaining what they are 
actually what's the interpretation of these two  

00:08:57.360 --> 00:09:02.670
to of the correlation between these error terms. 
It's lot easier to do that without an explanation  

00:09:02.670 --> 00:09:08.970
than adding a factor. So if you add a factor then 
your reviewers will ask you to explain it and  

00:09:08.970 --> 00:09:13.710
you always should so it's a good idea to have 
the factor instead of having the correlation.

00:09:13.710 --> 00:09:21.780
Mathematically both of these accomplish the 
exact same thing. They allow the unique aspect  

00:09:21.780 --> 00:09:26.700
of a3 and b1 that is correlated 
to escape from the error terms.

00:09:26.700 --> 00:09:35.100
This example of adding the minor factors can be  

00:09:35.100 --> 00:09:40.980
extended to also another scenario. So 
that's just the same indicators again.

00:09:40.980 --> 00:09:51.720
We can have this kind of scenario. So what's 
the scenario here. We have indicators a1  

00:09:51.720 --> 00:09:58.080
through a3 measure A. Indicators b1 to b3 
measure B. Then there's unreliability the  

00:09:58.080 --> 00:10:02.460
ease and then there is some variation 
that is shared by all the indicators.

00:10:02.460 --> 00:10:11.130
That variation could be for example variation due 
to the measurement method. So this is a scenario  

00:10:11.130 --> 00:10:16.200
where you would have common method variance. So 
the R would be here - the variation due to the  

00:10:16.200 --> 00:10:23.100
method or the common method variance - and then 
if we estimate the factor model with a1 a2 a3  

00:10:23.100 --> 00:10:29.130
loading on A and these b's keeps loading on B 
- then all variation during the method escapes  

00:10:29.130 --> 00:10:34.680
to this B factor and A factor and the factor 
correlation will be overestimated greatly.

00:10:34.680 --> 00:10:40.290
So in this kind of scenario it is 
possible to also specify a secondary  

00:10:40.290 --> 00:10:47.370
factor. So we can specify this method 
factor here and the idea is that all  

00:10:47.370 --> 00:10:51.810
the indicators load on the factors 
they're supposed to measure - the  

00:10:51.810 --> 00:10:57.540
factors representing the constructs - and a 
factor representing the measurement process.

00:10:57.540 --> 00:11:07.050
So looks really good. Looks good too good to be 
true. This is not a panacea for method variance  

00:11:07.050 --> 00:11:12.730
problems. There are... This kind of model is 
problematic for the estimate. The reason for  

00:11:12.730 --> 00:11:22.540
that is that a high correlation between A 
and B is nearly indistinguishable from a1  

00:11:22.540 --> 00:11:29.230
a2 a3 b1 b2 b3 just being caused by one 
factor. So they are empirically nearly  

00:11:29.230 --> 00:11:34.210
impossible to distinguish so this kind 
of model is very unstable to estimate.

00:11:34.210 --> 00:11:41.290
In practice these models have been shown 
to be problematic even with simulated  

00:11:41.290 --> 00:11:48.040
datasets but there's one way that this kind 
of model can work and it's if you add these  

00:11:48.040 --> 00:11:52.390
marker indicators. So sometimes you see 
in published papers that they use marker  

00:11:52.390 --> 00:11:58.300
indicators. The idea of marker indicators is 
that you have indicators that are unrelated  

00:11:58.300 --> 00:12:05.410
with the factors that you're modeling so 
a1 and b1 are unrelated to these m1 m2.

00:12:05.410 --> 00:12:12.880
For example if you use innovativeness and 
productivity and then you have questions  

00:12:12.880 --> 00:12:17.110
on one to seven scale - you could 
have a marker indicator of whether  

00:12:17.110 --> 00:12:20.650
the person likes jazz music or not. 
I've actually seen that being used.

00:12:20.650 --> 00:12:27.970
The idea is that how much you like jazz 
music is completely unrelated with the  

00:12:27.970 --> 00:12:34.690
innovativeness and productivity of your company 
but if the jazz music indicator correlates with  

00:12:34.690 --> 00:12:41.650
these indicators then we can assume that that 
correlation is purely due to the measurement  

00:12:41.650 --> 00:12:48.850
method because jazz music liking and innovation 
really are two completely different things.