WEBVTT WEBVTT Kind: captions Language: en 00:00:00.120 --> 00:00:03.480 There are two things that we need to consider before we can even 00:00:03.480 --> 00:00:09.660 start estimating confirmatory factor analysis model called scale setting and identification. 00:00:09.660 --> 00:00:14.790 The scale setting means that every variable must have a metric. So we have to be able to 00:00:14.790 --> 00:00:21.150 estimate the variance and sometimes the mean of every variable. An identification means 00:00:21.150 --> 00:00:26.460 that the data provides enough information to estimate the model that we want to estimate. 00:00:26.460 --> 00:00:34.080 So the confirmatory factor analysis framework is very flexible and it's possible to define models 00:00:34.080 --> 00:00:40.080 that are mathematically impossible to estimate uniquely. So in this video we will go through 00:00:40.080 --> 00:00:45.870 what requirements you have to consider before you can even estimate the model meaningfully. 00:00:45.870 --> 00:00:50.520 Let's take a look at this model with just two indicators. We have 00:00:50.520 --> 00:00:57.420 indicator a1 and a2 and then we want to estimate factor A. And we have two 00:00:57.420 --> 00:01:01.470 variances - these two error variances here - and then we have two factor loadings. So 00:01:01.470 --> 00:01:06.990 we have four things that we want to estimate and so four free parameters. 00:01:06.990 --> 00:01:13.680 Then we start estimating it. We calculate the model implied correlations. So we have two 00:01:13.680 --> 00:01:20.970 variances. Variance of a3 a2 and variance of a1 and then one correlation. So we have three 00:01:20.970 --> 00:01:28.290 unique elements of information from the data that we model using these four parameters. 00:01:28.290 --> 00:01:34.260 The problem is that now we have three units of information and we have four 00:01:34.260 --> 00:01:37.590 things that we want to estimate. So the degrees of freedom is minus 00:01:37.590 --> 00:01:42.150 one and that can be estimated or it can be estimated meaningfully. 00:01:42.150 --> 00:01:48.300 The reason is that or intuitive understanding insists that are 00:01:48.300 --> 00:01:53.400 you cannot estimate four things from a three things. So that's the idea. 00:01:53.400 --> 00:02:00.840 You have to have more information than what you want to estimate. So this is 00:02:00.840 --> 00:02:05.430 not identified and there are ways that we can simplify the model to actually be 00:02:05.430 --> 00:02:09.330 able to estimate something or we can add more indicators to make it identified. 00:02:09.330 --> 00:02:16.020 So this is not identified because the degrees of freedom is negative. And factor analysis 00:02:16.020 --> 00:02:21.570 without additional constraints always requires at least three indicators. 00:02:21.570 --> 00:02:28.050 Factor analysis of two indicators only it's not a very meaningful analysis anyway because while you 00:02:28.050 --> 00:02:34.920 can make it identified by saying that these factor loadings for example are the same - that would 00:02:34.920 --> 00:02:40.980 identify the model - then the estimation wouldn't give you any meaningful information anyway. 00:02:40.980 --> 00:02:48.600 So let's take another example or work more with this example. So let's assume that our 00:02:48.600 --> 00:02:56.490 correlation matrix for this two factor model each with one indicator is - so we have a1 and 00:02:56.490 --> 00:03:02.970 b1 they're corralated at 0.1 one and we have three parameters that we want to estimate. 00:03:02.970 --> 00:03:10.800 So you can't. We have one correlation that depends on three parameters and these other variances 00:03:10.800 --> 00:03:16.860 don't depend on the model or they do depend but we don't really care about those in this video. 00:03:16.860 --> 00:03:24.030 So why is the correlation with a1 and b1 so low? There are basically three different 00:03:24.030 --> 00:03:30.810 options. It's possible that a1 and b1 are both highly reliable indicators of these factors A 00:03:30.810 --> 00:03:38.910 and B. It's also possible that A and B are just weakly correlated. It's also possible 00:03:38.910 --> 00:03:45.150 that A and B are highly correlated but a1 is unreliable and therefore we observe only a 00:03:45.150 --> 00:03:51.000 small correlation or it's possible that A and B are highly correlated but b1 is unreliable. 00:03:51.000 --> 00:03:57.240 The problem is that we cannot know which of these three options is correct 00:03:57.240 --> 00:04:02.040 because they all have the same empirical implication which is that this correlation 00:04:02.040 --> 00:04:08.610 here is quite small. So that's another example of non identification problem. 00:04:08.610 --> 00:04:14.700 Here we are estimating five things so we have two error variances. We have two factor loadings and 00:04:14.700 --> 00:04:20.040 one factor correlation. We are trying to estimate it from just three elements of information. 00:04:20.040 --> 00:04:23.340 We can't do that. The model is not identified. We 00:04:23.340 --> 00:04:27.600 cannot know which one of these three explanations is correct empirically. 00:04:27.600 --> 00:04:34.950 Of course we can then use theory and rule out one of these base alternate explanations - based on 00:04:34.950 --> 00:04:39.270 theory - but that goes beyond our factor analysis estimates and identification. 00:04:39.270 --> 00:04:43.650 So this model is not identified. It cannot be estimated meaningfully. 00:04:43.650 --> 00:04:48.930 Let's take a look at scale setting now. So the identification basically 00:04:48.930 --> 00:04:54.780 means that you have more information than what you estimate. So the number 00:04:54.780 --> 00:05:00.750 of unique elements in the correlation matrix of the indicators must exceed or be the same 00:05:00.750 --> 00:05:03.570 as the number of three parameters that you estimate from the model. 00:05:03.570 --> 00:05:11.160 Okay. So normally we have - in exploratory factor analysis we have standardized factors - so the 00:05:11.160 --> 00:05:16.950 idea is that all the factors have variances of one means of zero in the exploratory analysis 00:05:16.950 --> 00:05:23.250 and that defines the scale of these variables. So every variable must have a variance in exploratory 00:05:23.250 --> 00:05:30.030 analysis. The factors are scaled to have unit variance so they're standardized and then all 00:05:30.030 --> 00:05:34.230 the factor loadings are then standardized regression coefficients for that reason. 00:05:34.230 --> 00:05:42.000 Then what if we don't standardize the factor so we are saying that instead of saying that 00:05:42.000 --> 00:05:49.230 the factors variance is one we are estimating the factors variances. So we add these factor 00:05:49.230 --> 00:05:53.880 variances here and factor variance here so we have fifteen free parameters. We still 00:05:53.880 --> 00:05:59.400 have 21 units of information from which we estimate but we estimate 15 different things 00:05:59.400 --> 00:06:06.600 so the degrees of freedom is 6 which means that this model is overidentified. So it's 00:06:06.600 --> 00:06:11.490 positive. So in principle it is possible to estimate this model meaningfully. 00:06:11.490 --> 00:06:18.030 We can do the estimation. So let's assume that that's our observed correlation matrix. 00:06:18.030 --> 00:06:24.240 That's our implied correlation matrix. Then we can find the values for the Y 00:06:24.240 --> 00:06:32.610 and the lambdas. So that this employed matrix reproduces this correlation matrix perfectly. 00:06:32.610 --> 00:06:38.370 In this case that's possible because these correlations all have the same 00:06:38.370 --> 00:06:41.580 values. Generally in small samples you will never 00:06:41.580 --> 00:06:45.960 completely reproduce the data but in this example you do just to simplify things. 00:06:45.960 --> 00:06:54.570 So we can estimate and that's one set of estimates that will give you the exact fit 00:06:54.570 --> 00:06:58.710 between the observed variable observed correlation matrix and the implied correlation matrix. 00:06:58.710 --> 00:07:09.270 So we're fine right. Turns out we have a small problem because there's another set of estimates 00:07:09.270 --> 00:07:14.730 that also reproduce the correlation matrix perfectly using the employment correlation 00:07:14.730 --> 00:07:22.080 matrix. So you can plug in these values to the equations and see that they produce the 00:07:22.080 --> 00:07:28.890 exact same implied correlations. So we have here factor A's variance is 1 versus factor 00:07:28.890 --> 00:07:36.450 B's variance 2 and therefore they are produced the same fit. So what do we 00:07:36.450 --> 00:07:43.590 do? We can go and come up with indefinitely many examples. So if factor A's variance is 00:07:43.590 --> 00:07:51.390 0.5 then we will all have a different values with factor loadings but still the empirical 00:07:51.390 --> 00:07:56.130 correlation matrix is reproduced perfectly using the model implied correlation matrix. 00:07:56.130 --> 00:08:03.420 So this the problem of scale setting of latent variables in confirmatory factor analysis models. 00:08:03.420 --> 00:08:12.810 So we need to set the metric. So the factors themselves because we don't observe the factors 00:08:12.810 --> 00:08:18.780 they are just arbitrary entries we don't know whether they vary from 0 to 1 or 0 to 1 million 00:08:18.780 --> 00:08:25.350 or minus 5 to plus 10 or whatever. We don't know their range. We don't know their variances. We 00:08:25.350 --> 00:08:32.310 don't know their means. We have to specify the scale of each latent each factor ourselves. 00:08:32.310 --> 00:08:37.590 In exploratory analysis we typically don't model means and then we assume 00:08:37.590 --> 00:08:41.670 that the variances or we fix the variances of the factors to be ones. 00:08:41.670 --> 00:08:52.650 In confirmatory analysis there are reasons why we don't fix the variances to pons. That I'll 00:08:52.650 --> 00:09:03.390 explain a bit later. But the problem generally is that we must define whether we are talking 00:09:03.390 --> 00:09:10.980 about centimetres or inches - do we talk about Celsius or Fahrenheit. They quantify the same 00:09:10.980 --> 00:09:19.140 exact thing and they are equally good measures from a statistical perspective to measure of 00:09:19.140 --> 00:09:23.910 length or temperature. We have to agree on what is the scale that you're using. 00:09:23.910 --> 00:09:32.580 So also a regression gives us the one unit change - the effect of one unit 00:09:32.580 --> 00:09:37.050 changing in the independent variable on the dependent variable - considering 00:09:37.050 --> 00:09:42.540 regression coefficients only makes sense after we have considered how 00:09:42.540 --> 00:09:48.120 we define the unit. So what is the unit of A and what is the unit of B. 00:09:48.120 --> 00:09:55.440 We have to set them manually. So we have to decide a scale setting approach. In exploratory 00:09:55.440 --> 00:10:01.800 analysis as I said we typically say that factor A and factor B on all factors have variances of 00:10:01.800 --> 00:10:08.550 one. That produces standardized factor loadings which are standard that regression coefficients 00:10:08.550 --> 00:10:16.110 of the indicators on the factors or in the case of uncorrelated factors they equal correlations. 00:10:16.110 --> 00:10:22.830 We use that in exploratory factor analysis. We cannot use that in structure regression 00:10:22.830 --> 00:10:26.880 model. Structure regression model is an extension of a factor analysis 00:10:26.880 --> 00:10:30.810 model where we allow regressing relationships between the factors. 00:10:30.810 --> 00:10:36.090 The reason why we can't use this approach is that the variation of 00:10:36.090 --> 00:10:41.370 an endogenous variable - so a variable that depends on other variables - is the sum of 00:10:41.370 --> 00:10:47.070 those other variables. So we can't say the variable's variance is one if that variance 00:10:47.070 --> 00:10:51.300 depends on other things in the model. But that's that's beyond this video. 00:10:51.300 --> 00:11:00.570 Another very common approach is that we set the first indicator to be fixed the first 00:11:00.570 --> 00:11:05.610 indicators loading to be one. And this is the default scale setting approach in 00:11:05.610 --> 00:11:10.290 most structural regression modelling or confirmatory factor analysis software. 00:11:10.290 --> 00:11:16.830 The reason is that this can be used pretty much always regardless of what kind of variables 00:11:16.830 --> 00:11:21.420 we have here as A and B and what kind of relationship will be specified between A 00:11:21.420 --> 00:11:29.370 and B. And the idea is that we scale that - if we assume that classical test theory 00:11:29.370 --> 00:11:36.420 holds - so all these errors here are just random noise - then the variance of A is 00:11:36.420 --> 00:11:43.950 whatever is the variance of the true score of a1. So that's also appealing if we consider 00:11:43.950 --> 00:11:51.660 that the only source of error is random noise - then the variance of factor A is the variation 00:11:51.660 --> 00:11:57.420 of a1 or what the various in a1 would be if it wasn't contaminated with this random noise here. 00:11:57.420 --> 00:12:05.520 So that's also a one way - one reason why this is appealing. It allows us to consider 00:12:05.520 --> 00:12:14.730 the scale of these indicators without error varies assuming classical test theory holds for the data. 00:12:14.730 --> 00:12:21.150 And this is such a common approach that there's a rule of thumb that I 00:12:21.150 --> 00:12:24.900 present. Always use the first indicators to fix the scale. 00:12:24.900 --> 00:12:32.670 We can see that the papers - that we have used as examples in these videos - are using this 00:12:32.670 --> 00:12:39.840 approach. Mesquita and Lazzarini - you can see all loadings of first indicators are ones. So they set 00:12:39.840 --> 00:12:46.830 the scale of the latent variable by fixing this loading to one and then they have the Z-statistic 00:12:46.830 --> 00:12:54.060 here and you can see that the indicators - the first indicators - don't have a Z- statistic. 00:12:54.060 --> 00:12:59.910 The reason is that they are not estimated from the data - instead a researcher says that these 00:12:59.910 --> 00:13:05.730 are ones they are not estimated if something is not estimated it doesn't vary from sample 00:13:05.730 --> 00:13:12.840 to sample. So it doesn't have a standard error. So we can't calculate or the Z-statistic for it. 00:13:12.840 --> 00:13:20.760 We can see the same in Yli-Renko's paper. So Yli-Renko's paper - the first loading it's not 00:13:20.760 --> 00:13:27.000 one but it doesn't have a standard error and he doesn't have a Z-statistic he doesn't have 00:13:27.000 --> 00:13:33.090 a standard error so that's indication that they actually are fix the first loading to be one to 00:13:33.090 --> 00:13:39.750 identify or the scale the latent variables. If you want to have standardized factor loadings 00:13:39.750 --> 00:13:47.160 so if you want to have loadings that are expressed in the scale of the Exploratory 00:13:47.160 --> 00:13:53.880 analysis where the factor variances are ones then you can rescale the confirmatory factor 00:13:53.880 --> 00:14:00.420 analysis results afterwards. Your software will produce that for you if you check the 00:14:00.420 --> 00:14:05.520 standardized estimates option there. So these are standardized estimates but the 00:14:05.520 --> 00:14:10.320 scaling has been done after estimation. So you first estimate and unstandardized 00:14:10.320 --> 00:14:18.690 confirmatory factor analysis where each factor is scaled by fixing the first indicator - then 00:14:18.690 --> 00:14:25.320 you scale the resulting solution. That's the same approach that you use for standardized regression 00:14:25.320 --> 00:14:30.300 coefficients. You first estimate regression then you scale the parameter estimates later. 00:14:30.300 --> 00:14:36.270 So the summary of identification of confirmatory factor analysis models. 00:14:36.270 --> 00:14:44.730 A model is identified if every latent variable has a scale and if the degrees of freedom is 00:14:44.730 --> 00:14:50.940 positive for and it's also every part of the model has to be identified. 00:14:50.940 --> 00:14:57.870 In confirmatory factor analysis - after we have established every latent variable every factor has 00:14:57.870 --> 00:15:05.130 a scale - then all factors with three indicators are always identified. So three indicators if you 00:15:05.130 --> 00:15:10.710 have three variables you can always run a factor analysis no matter what. Then if you have two 00:15:10.710 --> 00:15:18.270 factors -then we can either say that fix that both are equally reliable. So we fix the factor 00:15:18.270 --> 00:15:27.690 loadings to be ones or we can embed this factor in a larger system. So just two variables alone 00:15:27.690 --> 00:15:35.040 we can't estimate a factor model unless we fix these factor loadings to be the same. If we embed 00:15:35.040 --> 00:15:42.540 this two factor - the two indicator factor - into a larger factor analysis then we can 00:15:42.540 --> 00:15:47.940 estimate because we can use information from other indicators to estimate these factor loadings. 00:15:47.940 --> 00:15:52.140 And one single indicator rule - if we have a factor with just a single 00:15:52.140 --> 00:15:56.280 indicator then we cannot estimate the reliability of the indicator because 00:15:56.280 --> 00:16:00.150 you cannot estimate reliability based on just one measure. That's the idea. 00:16:00.150 --> 00:16:07.770 We have to assume what is their variance and typically we do that by constraining the error 00:16:07.770 --> 00:16:14.370 variance to be zero. So we say that this factor A or construct A is measured without any error if we 00:16:14.370 --> 00:16:19.470 can't estimate it. Of course we could constraint the error variance to be something else. If we 00:16:19.470 --> 00:16:26.610 know that the indicator has typically shown to be eighty percent reliable - then we can fix this 00:16:26.610 --> 00:16:33.060 variance here to be a 80 percent of the observed variance of the indicator but that's rarely done. 00:16:33.060 --> 00:16:37.290 So identification is a requirement for estimation. 00:16:37.290 --> 00:16:41.040 If our model is not identified it cannot be meaningfully estimated. 00:16:41.040 --> 00:16:45.870 Identification basically means that do you have enough information to estimate 00:16:45.870 --> 00:16:51.360 the model. If we have one correlation we can't estimate two different things 00:16:51.360 --> 00:16:56.460 from one correlation. You need at least one unit of information for 00:16:56.460 --> 00:17:01.170 everything that you estimate ideally you have more information so the redundancy. 00:17:01.170 --> 00:17:07.620 So we need to have a scale for inlatent variables and the degrees of freedom must be 00:17:07.620 --> 00:17:14.400 non-negative. Ideally it is positive and the more positive it is the better our model tests are.