WEBVTT WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:04.590 Confirmatory factor analysis differs  from exploratory factor analysis in   00:00:04.590 --> 00:00:09.210 that in a confirmatory factor analysis  the researcher specifies the factors for   00:00:09.210 --> 00:00:14.370 the data instead of having the computer  discover what are the underlying factors. 00:00:14.370 --> 00:00:20.820 So confirmatory factor analysis requires that you  specify the expected result and then computer will   00:00:20.820 --> 00:00:25.500 tell you if that result fits your data and  will estimate the factor loadings for you. 00:00:25.500 --> 00:00:29.970 This is a more flexible approach to factor  analysis than an exploratory analysis. 00:00:29.970 --> 00:00:35.580 In this video I will demonstrate you on a  conceptual level what confirmatory factor   00:00:35.580 --> 00:00:39.780 analysis does and how it can be applied  to more different kind of scenarios. 00:00:39.780 --> 00:00:50.190 Our data has six indicators. We have indicators a1  through a3 that are supposed to measure construct   00:00:50.190 --> 00:00:54.540 A so we can see that there is variation  in these indicators due to construct A   00:00:54.540 --> 00:00:59.580 and there is also some random noise the E  here though that's unreliability of these   00:00:59.580 --> 00:01:05.040 indicators and then we have some variance  components that is reliable. So for example   00:01:05.040 --> 00:01:11.850 if we measure a3 multiple times there is  a specific part of the a3 that is reliable   00:01:11.850 --> 00:01:18.750 but it is specific to a3. So for example if we  ask whether company is innovative or not that   00:01:18.750 --> 00:01:24.930 measures innovativeness it can also measure  something else. So the unreliability is not   00:01:24.930 --> 00:01:28.230 the only source of measurement error  but there's also some item uniqueness. 00:01:28.230 --> 00:01:34.680 And the idea of a confirmatory factor  analysis is that we specify factor model   00:01:34.680 --> 00:01:40.110 for this day ourselves. So for example  we would say that because these three   00:01:40.110 --> 00:01:46.680 indicators of a1 through a3 are supposed  to measure construct A - we assign them   00:01:46.680 --> 00:01:53.730 to factor A and then we assign these to factor  B and then each indicators gets an error term. 00:01:53.730 --> 00:02:00.480 Then the factor analysis takes the variance  of those indicators apart into variants that   00:02:00.480 --> 00:02:06.120 can be attributed to the factors and variance  that can be attributed error terms. Like so. 00:02:06.120 --> 00:02:12.330 So now we have a factor solution here. We have  all variation that is to due to the concept A   00:02:12.330 --> 00:02:18.060 goes to factor A all variation that is due to  the concept B goes to factor B and all these   00:02:18.060 --> 00:02:23.880 item uniqueness and unreliability goes to the  error terms that are assumed to be uncorrelated. 00:02:23.880 --> 00:02:30.240 So these uncorrelated distinct sources of  variation for its indicator and then we have   00:02:30.240 --> 00:02:32.790 the two common factors. So that's the ideal case. 00:02:32.790 --> 00:02:38.820 Sometimes your data are not as  as great as you would like and   00:02:38.820 --> 00:02:44.040 confirmatory factor analysis allows you to  model also problems in your measurement. 00:02:44.040 --> 00:02:51.750 So for example if we have this kind of scenario.  There is a variation again due to construct A   00:02:51.750 --> 00:02:59.190 variation to the construct B and then we have  unreliability - the black circles here - and   00:02:59.190 --> 00:03:06.210 we have unique aspects of each indicator but  there is also some variation in a3 and and   00:03:06.210 --> 00:03:12.570 b1 that correlates. The variance component  are here. These letters have no particular   00:03:12.570 --> 00:03:16.920 meaning by the way. They're just letters  to distinguish these are different circles. 00:03:16.920 --> 00:03:26.910 So a3 and b1 correlate for some other reason  than measuring A and measuring B that could   00:03:26.910 --> 00:03:34.740 possibly be correlated. If we fit a confirmatory  factor analysis model then this variation that   00:03:34.740 --> 00:03:41.670 is shared by a1 a3 and b1 actually goes to the  factors. The reason why it goes the factors is   00:03:41.670 --> 00:03:48.840 that these error terms are constrained to  be uncorrelated. And what happens now is   00:03:48.840 --> 00:03:55.320 that our factors - which are supposed to present  construct A and construct B - are contaminated by   00:03:55.320 --> 00:04:02.790 this secondary source of variation A and that  does present in a3 and b1 and as a consequence   00:04:02.790 --> 00:04:08.700 the correlation between factors A and B will be  overestimated and your results will be biased. 00:04:08.700 --> 00:04:16.230 And this is also the case in exploratory  factor analysis. So if you have two factors   00:04:16.230 --> 00:04:22.800 that influence or if we have this kind of like  minor factor the influence is a3 and b1 and we   00:04:22.800 --> 00:04:28.620 only get two factors then the factor correlation  between those two factors will be inflated. If   00:04:28.620 --> 00:04:34.680 we were to run exploratory analysis then the  exploratory analysis could identify that there   00:04:34.680 --> 00:04:40.740 is a third factor that loads on all a3 and  b1 but because it's just two indicators it's   00:04:40.740 --> 00:04:45.720 also possible that the exploratory analysis  wouldn't identify there that factor for us. 00:04:45.720 --> 00:04:52.230 So what can we do with this kind of situation?  Confirmatory analysis allows us to also model   00:04:52.230 --> 00:04:59.550 correlated error. So instead of specifying  that these error terms of a3 and b1 are   00:04:59.550 --> 00:05:05.640 uncorrelated we can say that it's possible  that a3 and b1 correlate for some other   00:05:05.640 --> 00:05:10.770 reason. So we relax the constraint. We  specify that these two can be correlated   00:05:10.770 --> 00:05:19.200 and then the variation in a3 and b1 that  is shared between these indicators but   00:05:19.200 --> 00:05:25.710 not with others - so it's not part of  the factors - then gets to escape these   00:05:25.710 --> 00:05:31.290 error terms and then we also again get clean  estimate of the factor correlation A and B. 00:05:31.290 --> 00:05:39.090 But this is something that many people do. So your  statistical software will tell you that the model   00:05:39.090 --> 00:05:44.280 doesn't fit the data perfectly and it'll also  tell you that you could freeze some correlations   00:05:44.280 --> 00:05:50.190 to make the model fit better but that's a bit  dangerous unless you know what you're doing. 00:05:50.190 --> 00:05:58.200 You should only add this kind of correlated errors  if you have a good theoretical reason to do so. So   00:05:58.200 --> 00:06:02.670 the fact that your statistical software tells you  to do is that you could do something to increase   00:06:02.670 --> 00:06:11.280 the model fit is another reason to do something  It's an indication that you could do something and   00:06:11.280 --> 00:06:17.670 you should consider something. It's not a definite  guideline that you should actually do that. 00:06:17.670 --> 00:06:25.260 So under which scenario then are you - is  it a good idea to allow the terms of two   00:06:25.260 --> 00:06:31.080 indicators to correlate? Dor example if our  indicators would look like that. So we would   00:06:31.080 --> 00:06:37.035 have indicators about innovativeness. So A here  - A factor here's innovativeness - and B factory   00:06:37.035 --> 00:06:42.630 is productivity. So we would have questions about  innovativeness and questions about productivity. 00:06:42.630 --> 00:06:51.390 Then we realized that okay so a3 is  our personal is innovative and b1 is   00:06:51.390 --> 00:06:56.040 our personal is productive. So both  of these actually have these this   00:06:56.040 --> 00:07:01.020 personnel dimension as well. So they  don't measure only innovativeness and   00:07:01.020 --> 00:07:06.600 productivity they also measure how high  quality the personnel in the company are. 00:07:06.600 --> 00:07:11.070 So then we realize that okay so there  is a secondary dimension that these two   00:07:11.070 --> 00:07:16.230 indicators measure and then we can add the  error correlation here. But also you have   00:07:16.230 --> 00:07:22.470 to justify it. So it's not enough that you say  that statistical software tells us that the more   00:07:22.470 --> 00:07:27.720 of it fits better if we do something you have  to justify it also in non statistical terms. 00:07:27.720 --> 00:07:32.730 This is the same thing like with the outliers  - you don't delete an observation because it   00:07:32.730 --> 00:07:38.100 is different you have to explain why it's  different in non statistical terms. The same   00:07:38.100 --> 00:07:45.090 thing when you eliminate indicators from a scale.  So your statistical software will tell you that   00:07:45.090 --> 00:07:51.990 sometimes eliminating an item from a scale will  make cronbach's alpha to go up but that's not   00:07:51.990 --> 00:07:57.660 a reason to eliminate an item. You should also  look at non statistical criteria. So what does   00:07:57.660 --> 00:08:03.990 the item look like is there a good reason why  we think it's less reliable. Because this kind   00:08:03.990 --> 00:08:10.530 of suggestions by your software they could also  be just a random correlation between two random   00:08:10.530 --> 00:08:16.470 elements - so random correlation between these E's  and then you would be missed specifying the model. 00:08:16.470 --> 00:08:23.640 Another way - perhaps a bit better way -  to accomplish the same is to specify this   00:08:23.640 --> 00:08:30.060 secondary factor. So we could instead saying that  these two errors are uncorrelated we could say   00:08:30.060 --> 00:08:36.690 that these indicators a3 and b1 actually are  also measuring something else. So we add this   00:08:36.690 --> 00:08:44.160 secondary factor here and this is a bit more  appealing approach because then it makes - you   00:08:44.160 --> 00:08:52.020 have to explicitly then interpret what this  factor means and it's a lot easier to free   00:08:52.020 --> 00:08:57.360 correlations without explaining what they are  actually what's the interpretation of these two   00:08:57.360 --> 00:09:02.670 to of the correlation between these error terms.  It's lot easier to do that without an explanation   00:09:02.670 --> 00:09:08.970 than adding a factor. So if you add a factor then  your reviewers will ask you to explain it and   00:09:08.970 --> 00:09:13.710 you always should so it's a good idea to have  the factor instead of having the correlation. 00:09:13.710 --> 00:09:21.780 Mathematically both of these accomplish the  exact same thing. They allow the unique aspect   00:09:21.780 --> 00:09:26.700 of a3 and b1 that is correlated  to escape from the error terms. 00:09:26.700 --> 00:09:35.100 This example of adding the minor factors can be   00:09:35.100 --> 00:09:40.980 extended to also another scenario. So  that's just the same indicators again. 00:09:40.980 --> 00:09:51.720 We can have this kind of scenario. So what's  the scenario here. We have indicators a1   00:09:51.720 --> 00:09:58.080 through a3 measure A. Indicators b1 to b3  measure B. Then there's unreliability the   00:09:58.080 --> 00:10:02.460 ease and then there is some variation  that is shared by all the indicators. 00:10:02.460 --> 00:10:11.130 That variation could be for example variation due  to the measurement method. So this is a scenario   00:10:11.130 --> 00:10:16.200 where you would have common method variance. So  the R would be here - the variation due to the   00:10:16.200 --> 00:10:23.100 method or the common method variance - and then  if we estimate the factor model with a1 a2 a3   00:10:23.100 --> 00:10:29.130 loading on A and these b's keeps loading on B  - then all variation during the method escapes   00:10:29.130 --> 00:10:34.680 to this B factor and A factor and the factor  correlation will be overestimated greatly. 00:10:34.680 --> 00:10:40.290 So in this kind of scenario it is  possible to also specify a secondary   00:10:40.290 --> 00:10:47.370 factor. So we can specify this method  factor here and the idea is that all   00:10:47.370 --> 00:10:51.810 the indicators load on the factors  they're supposed to measure - the   00:10:51.810 --> 00:10:57.540 factors representing the constructs - and a  factor representing the measurement process. 00:10:57.540 --> 00:11:07.050 So looks really good. Looks good too good to be  true. This is not a panacea for method variance   00:11:07.050 --> 00:11:12.730 problems. There are... This kind of model is  problematic for the estimate. The reason for   00:11:12.730 --> 00:11:22.540 that is that a high correlation between A  and B is nearly indistinguishable from a1   00:11:22.540 --> 00:11:29.230 a2 a3 b1 b2 b3 just being caused by one  factor. So they are empirically nearly   00:11:29.230 --> 00:11:34.210 impossible to distinguish so this kind  of model is very unstable to estimate. 00:11:34.210 --> 00:11:41.290 In practice these models have been shown  to be problematic even with simulated   00:11:41.290 --> 00:11:48.040 datasets but there's one way that this kind  of model can work and it's if you add these   00:11:48.040 --> 00:11:52.390 marker indicators. So sometimes you see  in published papers that they use marker   00:11:52.390 --> 00:11:58.300 indicators. The idea of marker indicators is  that you have indicators that are unrelated   00:11:58.300 --> 00:12:05.410 with the factors that you're modeling so  a1 and b1 are unrelated to these m1 m2. 00:12:05.410 --> 00:12:12.880 For example if you use innovativeness and  productivity and then you have questions   00:12:12.880 --> 00:12:17.110 on one to seven scale - you could  have a marker indicator of whether   00:12:17.110 --> 00:12:20.650 the person likes jazz music or not.  I've actually seen that being used. 00:12:20.650 --> 00:12:27.970 The idea is that how much you like jazz  music is completely unrelated with the   00:12:27.970 --> 00:12:34.690 innovativeness and productivity of your company  but if the jazz music indicator correlates with   00:12:34.690 --> 00:12:41.650 these indicators then we can assume that that  correlation is purely due to the measurement   00:12:41.650 --> 00:12:48.850 method because jazz music liking and innovation  really are two completely different things.