WEBVTT WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:04.590 Confirmatory factor analysis differs from exploratory factor analysis in 00:00:04.590 --> 00:00:09.210 that in a confirmatory factor analysis the researcher specifies the factors for 00:00:09.210 --> 00:00:14.370 the data instead of having the computer discover what are the underlying factors. 00:00:14.370 --> 00:00:20.820 So confirmatory factor analysis requires that you specify the expected result and then computer will 00:00:20.820 --> 00:00:25.500 tell you if that result fits your data and will estimate the factor loadings for you. 00:00:25.500 --> 00:00:29.970 This is a more flexible approach to factor analysis than an exploratory analysis. 00:00:29.970 --> 00:00:35.580 In this video I will demonstrate you on a conceptual level what confirmatory factor 00:00:35.580 --> 00:00:39.780 analysis does and how it can be applied to more different kind of scenarios. 00:00:39.780 --> 00:00:50.190 Our data has six indicators. We have indicators a1 through a3 that are supposed to measure construct 00:00:50.190 --> 00:00:54.540 A so we can see that there is variation in these indicators due to construct A 00:00:54.540 --> 00:00:59.580 and there is also some random noise the E here though that's unreliability of these 00:00:59.580 --> 00:01:05.040 indicators and then we have some variance components that is reliable. So for example 00:01:05.040 --> 00:01:11.850 if we measure a3 multiple times there is a specific part of the a3 that is reliable 00:01:11.850 --> 00:01:18.750 but it is specific to a3. So for example if we ask whether company is innovative or not that 00:01:18.750 --> 00:01:24.930 measures innovativeness it can also measure something else. So the unreliability is not 00:01:24.930 --> 00:01:28.230 the only source of measurement error but there's also some item uniqueness. 00:01:28.230 --> 00:01:34.680 And the idea of a confirmatory factor analysis is that we specify factor model 00:01:34.680 --> 00:01:40.110 for this day ourselves. So for example we would say that because these three 00:01:40.110 --> 00:01:46.680 indicators of a1 through a3 are supposed to measure construct A - we assign them 00:01:46.680 --> 00:01:53.730 to factor A and then we assign these to factor B and then each indicators gets an error term. 00:01:53.730 --> 00:02:00.480 Then the factor analysis takes the variance of those indicators apart into variants that 00:02:00.480 --> 00:02:06.120 can be attributed to the factors and variance that can be attributed error terms. Like so. 00:02:06.120 --> 00:02:12.330 So now we have a factor solution here. We have all variation that is to due to the concept A 00:02:12.330 --> 00:02:18.060 goes to factor A all variation that is due to the concept B goes to factor B and all these 00:02:18.060 --> 00:02:23.880 item uniqueness and unreliability goes to the error terms that are assumed to be uncorrelated. 00:02:23.880 --> 00:02:30.240 So these uncorrelated distinct sources of variation for its indicator and then we have 00:02:30.240 --> 00:02:32.790 the two common factors. So that's the ideal case. 00:02:32.790 --> 00:02:38.820 Sometimes your data are not as as great as you would like and 00:02:38.820 --> 00:02:44.040 confirmatory factor analysis allows you to model also problems in your measurement. 00:02:44.040 --> 00:02:51.750 So for example if we have this kind of scenario. There is a variation again due to construct A 00:02:51.750 --> 00:02:59.190 variation to the construct B and then we have unreliability - the black circles here - and 00:02:59.190 --> 00:03:06.210 we have unique aspects of each indicator but there is also some variation in a3 and and 00:03:06.210 --> 00:03:12.570 b1 that correlates. The variance component are here. These letters have no particular 00:03:12.570 --> 00:03:16.920 meaning by the way. They're just letters to distinguish these are different circles. 00:03:16.920 --> 00:03:26.910 So a3 and b1 correlate for some other reason than measuring A and measuring B that could 00:03:26.910 --> 00:03:34.740 possibly be correlated. If we fit a confirmatory factor analysis model then this variation that 00:03:34.740 --> 00:03:41.670 is shared by a1 a3 and b1 actually goes to the factors. The reason why it goes the factors is 00:03:41.670 --> 00:03:48.840 that these error terms are constrained to be uncorrelated. And what happens now is 00:03:48.840 --> 00:03:55.320 that our factors - which are supposed to present construct A and construct B - are contaminated by 00:03:55.320 --> 00:04:02.790 this secondary source of variation A and that does present in a3 and b1 and as a consequence 00:04:02.790 --> 00:04:08.700 the correlation between factors A and B will be overestimated and your results will be biased. 00:04:08.700 --> 00:04:16.230 And this is also the case in exploratory factor analysis. So if you have two factors 00:04:16.230 --> 00:04:22.800 that influence or if we have this kind of like minor factor the influence is a3 and b1 and we 00:04:22.800 --> 00:04:28.620 only get two factors then the factor correlation between those two factors will be inflated. If 00:04:28.620 --> 00:04:34.680 we were to run exploratory analysis then the exploratory analysis could identify that there 00:04:34.680 --> 00:04:40.740 is a third factor that loads on all a3 and b1 but because it's just two indicators it's 00:04:40.740 --> 00:04:45.720 also possible that the exploratory analysis wouldn't identify there that factor for us. 00:04:45.720 --> 00:04:52.230 So what can we do with this kind of situation? Confirmatory analysis allows us to also model 00:04:52.230 --> 00:04:59.550 correlated error. So instead of specifying that these error terms of a3 and b1 are 00:04:59.550 --> 00:05:05.640 uncorrelated we can say that it's possible that a3 and b1 correlate for some other 00:05:05.640 --> 00:05:10.770 reason. So we relax the constraint. We specify that these two can be correlated 00:05:10.770 --> 00:05:19.200 and then the variation in a3 and b1 that is shared between these indicators but 00:05:19.200 --> 00:05:25.710 not with others - so it's not part of the factors - then gets to escape these 00:05:25.710 --> 00:05:31.290 error terms and then we also again get clean estimate of the factor correlation A and B. 00:05:31.290 --> 00:05:39.090 But this is something that many people do. So your statistical software will tell you that the model 00:05:39.090 --> 00:05:44.280 doesn't fit the data perfectly and it'll also tell you that you could freeze some correlations 00:05:44.280 --> 00:05:50.190 to make the model fit better but that's a bit dangerous unless you know what you're doing. 00:05:50.190 --> 00:05:58.200 You should only add this kind of correlated errors if you have a good theoretical reason to do so. So 00:05:58.200 --> 00:06:02.670 the fact that your statistical software tells you to do is that you could do something to increase 00:06:02.670 --> 00:06:11.280 the model fit is another reason to do something It's an indication that you could do something and 00:06:11.280 --> 00:06:17.670 you should consider something. It's not a definite guideline that you should actually do that. 00:06:17.670 --> 00:06:25.260 So under which scenario then are you - is it a good idea to allow the terms of two 00:06:25.260 --> 00:06:31.080 indicators to correlate? Dor example if our indicators would look like that. So we would 00:06:31.080 --> 00:06:37.035 have indicators about innovativeness. So A here - A factor here's innovativeness - and B factory 00:06:37.035 --> 00:06:42.630 is productivity. So we would have questions about innovativeness and questions about productivity. 00:06:42.630 --> 00:06:51.390 Then we realized that okay so a3 is our personal is innovative and b1 is 00:06:51.390 --> 00:06:56.040 our personal is productive. So both of these actually have these this 00:06:56.040 --> 00:07:01.020 personnel dimension as well. So they don't measure only innovativeness and 00:07:01.020 --> 00:07:06.600 productivity they also measure how high quality the personnel in the company are. 00:07:06.600 --> 00:07:11.070 So then we realize that okay so there is a secondary dimension that these two 00:07:11.070 --> 00:07:16.230 indicators measure and then we can add the error correlation here. But also you have 00:07:16.230 --> 00:07:22.470 to justify it. So it's not enough that you say that statistical software tells us that the more 00:07:22.470 --> 00:07:27.720 of it fits better if we do something you have to justify it also in non statistical terms. 00:07:27.720 --> 00:07:32.730 This is the same thing like with the outliers - you don't delete an observation because it 00:07:32.730 --> 00:07:38.100 is different you have to explain why it's different in non statistical terms. The same 00:07:38.100 --> 00:07:45.090 thing when you eliminate indicators from a scale. So your statistical software will tell you that 00:07:45.090 --> 00:07:51.990 sometimes eliminating an item from a scale will make cronbach's alpha to go up but that's not 00:07:51.990 --> 00:07:57.660 a reason to eliminate an item. You should also look at non statistical criteria. So what does 00:07:57.660 --> 00:08:03.990 the item look like is there a good reason why we think it's less reliable. Because this kind 00:08:03.990 --> 00:08:10.530 of suggestions by your software they could also be just a random correlation between two random 00:08:10.530 --> 00:08:16.470 elements - so random correlation between these E's and then you would be missed specifying the model. 00:08:16.470 --> 00:08:23.640 Another way - perhaps a bit better way - to accomplish the same is to specify this 00:08:23.640 --> 00:08:30.060 secondary factor. So we could instead saying that these two errors are uncorrelated we could say 00:08:30.060 --> 00:08:36.690 that these indicators a3 and b1 actually are also measuring something else. So we add this 00:08:36.690 --> 00:08:44.160 secondary factor here and this is a bit more appealing approach because then it makes - you 00:08:44.160 --> 00:08:52.020 have to explicitly then interpret what this factor means and it's a lot easier to free 00:08:52.020 --> 00:08:57.360 correlations without explaining what they are actually what's the interpretation of these two 00:08:57.360 --> 00:09:02.670 to of the correlation between these error terms. It's lot easier to do that without an explanation 00:09:02.670 --> 00:09:08.970 than adding a factor. So if you add a factor then your reviewers will ask you to explain it and 00:09:08.970 --> 00:09:13.710 you always should so it's a good idea to have the factor instead of having the correlation. 00:09:13.710 --> 00:09:21.780 Mathematically both of these accomplish the exact same thing. They allow the unique aspect 00:09:21.780 --> 00:09:26.700 of a3 and b1 that is correlated to escape from the error terms. 00:09:26.700 --> 00:09:35.100 This example of adding the minor factors can be 00:09:35.100 --> 00:09:40.980 extended to also another scenario. So that's just the same indicators again. 00:09:40.980 --> 00:09:51.720 We can have this kind of scenario. So what's the scenario here. We have indicators a1 00:09:51.720 --> 00:09:58.080 through a3 measure A. Indicators b1 to b3 measure B. Then there's unreliability the 00:09:58.080 --> 00:10:02.460 ease and then there is some variation that is shared by all the indicators. 00:10:02.460 --> 00:10:11.130 That variation could be for example variation due to the measurement method. So this is a scenario 00:10:11.130 --> 00:10:16.200 where you would have common method variance. So the R would be here - the variation due to the 00:10:16.200 --> 00:10:23.100 method or the common method variance - and then if we estimate the factor model with a1 a2 a3 00:10:23.100 --> 00:10:29.130 loading on A and these b's keeps loading on B - then all variation during the method escapes 00:10:29.130 --> 00:10:34.680 to this B factor and A factor and the factor correlation will be overestimated greatly. 00:10:34.680 --> 00:10:40.290 So in this kind of scenario it is possible to also specify a secondary 00:10:40.290 --> 00:10:47.370 factor. So we can specify this method factor here and the idea is that all 00:10:47.370 --> 00:10:51.810 the indicators load on the factors they're supposed to measure - the 00:10:51.810 --> 00:10:57.540 factors representing the constructs - and a factor representing the measurement process. 00:10:57.540 --> 00:11:07.050 So looks really good. Looks good too good to be true. This is not a panacea for method variance 00:11:07.050 --> 00:11:12.730 problems. There are... This kind of model is problematic for the estimate. The reason for 00:11:12.730 --> 00:11:22.540 that is that a high correlation between A and B is nearly indistinguishable from a1 00:11:22.540 --> 00:11:29.230 a2 a3 b1 b2 b3 just being caused by one factor. So they are empirically nearly 00:11:29.230 --> 00:11:34.210 impossible to distinguish so this kind of model is very unstable to estimate. 00:11:34.210 --> 00:11:41.290 In practice these models have been shown to be problematic even with simulated 00:11:41.290 --> 00:11:48.040 datasets but there's one way that this kind of model can work and it's if you add these 00:11:48.040 --> 00:11:52.390 marker indicators. So sometimes you see in published papers that they use marker 00:11:52.390 --> 00:11:58.300 indicators. The idea of marker indicators is that you have indicators that are unrelated 00:11:58.300 --> 00:12:05.410 with the factors that you're modeling so a1 and b1 are unrelated to these m1 m2. 00:12:05.410 --> 00:12:12.880 For example if you use innovativeness and productivity and then you have questions 00:12:12.880 --> 00:12:17.110 on one to seven scale - you could have a marker indicator of whether 00:12:17.110 --> 00:12:20.650 the person likes jazz music or not. I've actually seen that being used. 00:12:20.650 --> 00:12:27.970 The idea is that how much you like jazz music is completely unrelated with the 00:12:27.970 --> 00:12:34.690 innovativeness and productivity of your company but if the jazz music indicator correlates with 00:12:34.690 --> 00:12:41.650 these indicators then we can assume that that correlation is purely due to the measurement 00:12:41.650 --> 00:12:48.850 method because jazz music liking and innovation really are two completely different things.