WEBVTT
WEBVTT
Kind: captions
Language: en
00:00:00.030 --> 00:00:04.590
Confirmatory factor analysis differs
from exploratory factor analysis in
00:00:04.590 --> 00:00:09.210
that in a confirmatory factor analysis
the researcher specifies the factors for
00:00:09.210 --> 00:00:14.370
the data instead of having the computer
discover what are the underlying factors.
00:00:14.370 --> 00:00:20.820
So confirmatory factor analysis requires that you
specify the expected result and then computer will
00:00:20.820 --> 00:00:25.500
tell you if that result fits your data and
will estimate the factor loadings for you.
00:00:25.500 --> 00:00:29.970
This is a more flexible approach to factor
analysis than an exploratory analysis.
00:00:29.970 --> 00:00:35.580
In this video I will demonstrate you on a
conceptual level what confirmatory factor
00:00:35.580 --> 00:00:39.780
analysis does and how it can be applied
to more different kind of scenarios.
00:00:39.780 --> 00:00:50.190
Our data has six indicators. We have indicators a1
through a3 that are supposed to measure construct
00:00:50.190 --> 00:00:54.540
A so we can see that there is variation
in these indicators due to construct A
00:00:54.540 --> 00:00:59.580
and there is also some random noise the E
here though that's unreliability of these
00:00:59.580 --> 00:01:05.040
indicators and then we have some variance
components that is reliable. So for example
00:01:05.040 --> 00:01:11.850
if we measure a3 multiple times there is
a specific part of the a3 that is reliable
00:01:11.850 --> 00:01:18.750
but it is specific to a3. So for example if we
ask whether company is innovative or not that
00:01:18.750 --> 00:01:24.930
measures innovativeness it can also measure
something else. So the unreliability is not
00:01:24.930 --> 00:01:28.230
the only source of measurement error
but there's also some item uniqueness.
00:01:28.230 --> 00:01:34.680
And the idea of a confirmatory factor
analysis is that we specify factor model
00:01:34.680 --> 00:01:40.110
for this day ourselves. So for example
we would say that because these three
00:01:40.110 --> 00:01:46.680
indicators of a1 through a3 are supposed
to measure construct A - we assign them
00:01:46.680 --> 00:01:53.730
to factor A and then we assign these to factor
B and then each indicators gets an error term.
00:01:53.730 --> 00:02:00.480
Then the factor analysis takes the variance
of those indicators apart into variants that
00:02:00.480 --> 00:02:06.120
can be attributed to the factors and variance
that can be attributed error terms. Like so.
00:02:06.120 --> 00:02:12.330
So now we have a factor solution here. We have
all variation that is to due to the concept A
00:02:12.330 --> 00:02:18.060
goes to factor A all variation that is due to
the concept B goes to factor B and all these
00:02:18.060 --> 00:02:23.880
item uniqueness and unreliability goes to the
error terms that are assumed to be uncorrelated.
00:02:23.880 --> 00:02:30.240
So these uncorrelated distinct sources of
variation for its indicator and then we have
00:02:30.240 --> 00:02:32.790
the two common factors. So that's the ideal case.
00:02:32.790 --> 00:02:38.820
Sometimes your data are not as
as great as you would like and
00:02:38.820 --> 00:02:44.040
confirmatory factor analysis allows you to
model also problems in your measurement.
00:02:44.040 --> 00:02:51.750
So for example if we have this kind of scenario.
There is a variation again due to construct A
00:02:51.750 --> 00:02:59.190
variation to the construct B and then we have
unreliability - the black circles here - and
00:02:59.190 --> 00:03:06.210
we have unique aspects of each indicator but
there is also some variation in a3 and and
00:03:06.210 --> 00:03:12.570
b1 that correlates. The variance component
are here. These letters have no particular
00:03:12.570 --> 00:03:16.920
meaning by the way. They're just letters
to distinguish these are different circles.
00:03:16.920 --> 00:03:26.910
So a3 and b1 correlate for some other reason
than measuring A and measuring B that could
00:03:26.910 --> 00:03:34.740
possibly be correlated. If we fit a confirmatory
factor analysis model then this variation that
00:03:34.740 --> 00:03:41.670
is shared by a1 a3 and b1 actually goes to the
factors. The reason why it goes the factors is
00:03:41.670 --> 00:03:48.840
that these error terms are constrained to
be uncorrelated. And what happens now is
00:03:48.840 --> 00:03:55.320
that our factors - which are supposed to present
construct A and construct B - are contaminated by
00:03:55.320 --> 00:04:02.790
this secondary source of variation A and that
does present in a3 and b1 and as a consequence
00:04:02.790 --> 00:04:08.700
the correlation between factors A and B will be
overestimated and your results will be biased.
00:04:08.700 --> 00:04:16.230
And this is also the case in exploratory
factor analysis. So if you have two factors
00:04:16.230 --> 00:04:22.800
that influence or if we have this kind of like
minor factor the influence is a3 and b1 and we
00:04:22.800 --> 00:04:28.620
only get two factors then the factor correlation
between those two factors will be inflated. If
00:04:28.620 --> 00:04:34.680
we were to run exploratory analysis then the
exploratory analysis could identify that there
00:04:34.680 --> 00:04:40.740
is a third factor that loads on all a3 and
b1 but because it's just two indicators it's
00:04:40.740 --> 00:04:45.720
also possible that the exploratory analysis
wouldn't identify there that factor for us.
00:04:45.720 --> 00:04:52.230
So what can we do with this kind of situation?
Confirmatory analysis allows us to also model
00:04:52.230 --> 00:04:59.550
correlated error. So instead of specifying
that these error terms of a3 and b1 are
00:04:59.550 --> 00:05:05.640
uncorrelated we can say that it's possible
that a3 and b1 correlate for some other
00:05:05.640 --> 00:05:10.770
reason. So we relax the constraint. We
specify that these two can be correlated
00:05:10.770 --> 00:05:19.200
and then the variation in a3 and b1 that
is shared between these indicators but
00:05:19.200 --> 00:05:25.710
not with others - so it's not part of
the factors - then gets to escape these
00:05:25.710 --> 00:05:31.290
error terms and then we also again get clean
estimate of the factor correlation A and B.
00:05:31.290 --> 00:05:39.090
But this is something that many people do. So your
statistical software will tell you that the model
00:05:39.090 --> 00:05:44.280
doesn't fit the data perfectly and it'll also
tell you that you could freeze some correlations
00:05:44.280 --> 00:05:50.190
to make the model fit better but that's a bit
dangerous unless you know what you're doing.
00:05:50.190 --> 00:05:58.200
You should only add this kind of correlated errors
if you have a good theoretical reason to do so. So
00:05:58.200 --> 00:06:02.670
the fact that your statistical software tells you
to do is that you could do something to increase
00:06:02.670 --> 00:06:11.280
the model fit is another reason to do something
It's an indication that you could do something and
00:06:11.280 --> 00:06:17.670
you should consider something. It's not a definite
guideline that you should actually do that.
00:06:17.670 --> 00:06:25.260
So under which scenario then are you - is
it a good idea to allow the terms of two
00:06:25.260 --> 00:06:31.080
indicators to correlate? Dor example if our
indicators would look like that. So we would
00:06:31.080 --> 00:06:37.035
have indicators about innovativeness. So A here
- A factor here's innovativeness - and B factory
00:06:37.035 --> 00:06:42.630
is productivity. So we would have questions about
innovativeness and questions about productivity.
00:06:42.630 --> 00:06:51.390
Then we realized that okay so a3 is
our personal is innovative and b1 is
00:06:51.390 --> 00:06:56.040
our personal is productive. So both
of these actually have these this
00:06:56.040 --> 00:07:01.020
personnel dimension as well. So they
don't measure only innovativeness and
00:07:01.020 --> 00:07:06.600
productivity they also measure how high
quality the personnel in the company are.
00:07:06.600 --> 00:07:11.070
So then we realize that okay so there
is a secondary dimension that these two
00:07:11.070 --> 00:07:16.230
indicators measure and then we can add the
error correlation here. But also you have
00:07:16.230 --> 00:07:22.470
to justify it. So it's not enough that you say
that statistical software tells us that the more
00:07:22.470 --> 00:07:27.720
of it fits better if we do something you have
to justify it also in non statistical terms.
00:07:27.720 --> 00:07:32.730
This is the same thing like with the outliers
- you don't delete an observation because it
00:07:32.730 --> 00:07:38.100
is different you have to explain why it's
different in non statistical terms. The same
00:07:38.100 --> 00:07:45.090
thing when you eliminate indicators from a scale.
So your statistical software will tell you that
00:07:45.090 --> 00:07:51.990
sometimes eliminating an item from a scale will
make cronbach's alpha to go up but that's not
00:07:51.990 --> 00:07:57.660
a reason to eliminate an item. You should also
look at non statistical criteria. So what does
00:07:57.660 --> 00:08:03.990
the item look like is there a good reason why
we think it's less reliable. Because this kind
00:08:03.990 --> 00:08:10.530
of suggestions by your software they could also
be just a random correlation between two random
00:08:10.530 --> 00:08:16.470
elements - so random correlation between these E's
and then you would be missed specifying the model.
00:08:16.470 --> 00:08:23.640
Another way - perhaps a bit better way -
to accomplish the same is to specify this
00:08:23.640 --> 00:08:30.060
secondary factor. So we could instead saying that
these two errors are uncorrelated we could say
00:08:30.060 --> 00:08:36.690
that these indicators a3 and b1 actually are
also measuring something else. So we add this
00:08:36.690 --> 00:08:44.160
secondary factor here and this is a bit more
appealing approach because then it makes - you
00:08:44.160 --> 00:08:52.020
have to explicitly then interpret what this
factor means and it's a lot easier to free
00:08:52.020 --> 00:08:57.360
correlations without explaining what they are
actually what's the interpretation of these two
00:08:57.360 --> 00:09:02.670
to of the correlation between these error terms.
It's lot easier to do that without an explanation
00:09:02.670 --> 00:09:08.970
than adding a factor. So if you add a factor then
your reviewers will ask you to explain it and
00:09:08.970 --> 00:09:13.710
you always should so it's a good idea to have
the factor instead of having the correlation.
00:09:13.710 --> 00:09:21.780
Mathematically both of these accomplish the
exact same thing. They allow the unique aspect
00:09:21.780 --> 00:09:26.700
of a3 and b1 that is correlated
to escape from the error terms.
00:09:26.700 --> 00:09:35.100
This example of adding the minor factors can be
00:09:35.100 --> 00:09:40.980
extended to also another scenario. So
that's just the same indicators again.
00:09:40.980 --> 00:09:51.720
We can have this kind of scenario. So what's
the scenario here. We have indicators a1
00:09:51.720 --> 00:09:58.080
through a3 measure A. Indicators b1 to b3
measure B. Then there's unreliability the
00:09:58.080 --> 00:10:02.460
ease and then there is some variation
that is shared by all the indicators.
00:10:02.460 --> 00:10:11.130
That variation could be for example variation due
to the measurement method. So this is a scenario
00:10:11.130 --> 00:10:16.200
where you would have common method variance. So
the R would be here - the variation due to the
00:10:16.200 --> 00:10:23.100
method or the common method variance - and then
if we estimate the factor model with a1 a2 a3
00:10:23.100 --> 00:10:29.130
loading on A and these b's keeps loading on B
- then all variation during the method escapes
00:10:29.130 --> 00:10:34.680
to this B factor and A factor and the factor
correlation will be overestimated greatly.
00:10:34.680 --> 00:10:40.290
So in this kind of scenario it is
possible to also specify a secondary
00:10:40.290 --> 00:10:47.370
factor. So we can specify this method
factor here and the idea is that all
00:10:47.370 --> 00:10:51.810
the indicators load on the factors
they're supposed to measure - the
00:10:51.810 --> 00:10:57.540
factors representing the constructs - and a
factor representing the measurement process.
00:10:57.540 --> 00:11:07.050
So looks really good. Looks good too good to be
true. This is not a panacea for method variance
00:11:07.050 --> 00:11:12.730
problems. There are... This kind of model is
problematic for the estimate. The reason for
00:11:12.730 --> 00:11:22.540
that is that a high correlation between A
and B is nearly indistinguishable from a1
00:11:22.540 --> 00:11:29.230
a2 a3 b1 b2 b3 just being caused by one
factor. So they are empirically nearly
00:11:29.230 --> 00:11:34.210
impossible to distinguish so this kind
of model is very unstable to estimate.
00:11:34.210 --> 00:11:41.290
In practice these models have been shown
to be problematic even with simulated
00:11:41.290 --> 00:11:48.040
datasets but there's one way that this kind
of model can work and it's if you add these
00:11:48.040 --> 00:11:52.390
marker indicators. So sometimes you see
in published papers that they use marker
00:11:52.390 --> 00:11:58.300
indicators. The idea of marker indicators is
that you have indicators that are unrelated
00:11:58.300 --> 00:12:05.410
with the factors that you're modeling so
a1 and b1 are unrelated to these m1 m2.
00:12:05.410 --> 00:12:12.880
For example if you use innovativeness and
productivity and then you have questions
00:12:12.880 --> 00:12:17.110
on one to seven scale - you could
have a marker indicator of whether
00:12:17.110 --> 00:12:20.650
the person likes jazz music or not.
I've actually seen that being used.
00:12:20.650 --> 00:12:27.970
The idea is that how much you like jazz
music is completely unrelated with the
00:12:27.970 --> 00:12:34.690
innovativeness and productivity of your company
but if the jazz music indicator correlates with
00:12:34.690 --> 00:12:41.650
these indicators then we can assume that that
correlation is purely due to the measurement
00:12:41.650 --> 00:12:48.850
method because jazz music liking and innovation
really are two completely different things.