WEBVTT WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:04.200 Factor analysis is a very useful  tool for validating measurement. 00:00:04.200 --> 00:00:09.390 The idea of factor analysis is that  they take in multiple indicators and   00:00:09.390 --> 00:00:12.960 then it answers the question what  do the indicators have in common.   00:00:12.960 --> 00:00:18.840 So it tries to extract or identify  underlying dimensions from your data. 00:00:18.840 --> 00:00:24.810 The reason why we use factor analysis for  measurement is that before we apply any   00:00:24.810 --> 00:00:28.740 reliability statistics we have  to study if the indicators are   00:00:28.740 --> 00:00:33.690 uni-dimensional - if so then we use a  uni dimensional reliability index - if   00:00:33.690 --> 00:00:38.820 not then we calculate the reliability  statistic based on the factor analysis. 00:00:38.820 --> 00:00:45.000 Factor analysis also can be used to assess  the hypothesis that the indicators are   00:00:45.000 --> 00:00:51.180 consequences of a common cause and in that  way we can justify - try to use the factor   00:00:51.180 --> 00:00:57.720 analysis to justify causal claims where we  say that the construct causes multiple items. 00:00:57.720 --> 00:01:02.430 The factor analysis techniques  are - there are two main variants:   00:01:02.430 --> 00:01:06.060 exploratory factor analysis and  confirmatory factor analysis. 00:01:06.060 --> 00:01:13.890 In exploratory factor analysis the core - it's an  exploratory process where you give the computer   00:01:13.890 --> 00:01:20.820 your dataset and then you ask the computer to  give you three factors two factors or how many   00:01:20.820 --> 00:01:26.070 factors you can - you want to have from the data  - and then the computer will identify the factors. 00:01:26.070 --> 00:01:33.000 In confirmatory factor analysis you specify  the factor structure yourself. So you say   00:01:33.000 --> 00:01:37.500 the first three indicators for example  measure one thing that is one factor   00:01:37.500 --> 00:01:42.570 then the second three measure another thing  that's a factor and then the remaining four   00:01:42.570 --> 00:01:47.190 indicators measure a third thing and that's  up the third factor and then the computer   00:01:47.190 --> 00:01:53.490 will estimate the model for you and tell  if that model is plausible for the data. 00:01:53.490 --> 00:01:59.730 Exploratory factor analysis is easy to apply  because you don't have to specify the structure   00:01:59.730 --> 00:02:04.530 yourself - you just specify the number of  indicators and which variables you use and   00:02:04.530 --> 00:02:11.310 for that reason many people get started with the  exploratory factor analysis instead and if you do   00:02:11.310 --> 00:02:17.170 data exploration or some initial analysis then  exploratory factor analysis is quicker to do. 00:02:17.170 --> 00:02:21.100 Then exploratory analysis is the one that is   00:02:21.100 --> 00:02:24.550 typically covered first followed  by confirmatory factor analysis. 00:02:24.550 --> 00:02:30.760 I will demonstrate factor analysis using  the exploratory approach and to do that we   00:02:30.760 --> 00:02:37.330 need some data and our data are from Olympic  decathlon. So we have the ten sports that the   00:02:37.330 --> 00:02:46.270 athletes do that are 100 meters run, long  jump, shotput, high jump, 400 meters run,   00:02:46.270 --> 00:02:54.130 110 meter hurdles, discus throw, pole  vault, javelin throw and 1500 meters run. 00:02:54.130 --> 00:03:01.420 So there are 10 different sports that you do  in this competition and then you are rated   00:03:01.420 --> 00:03:07.060 based on your performance and all. And the  overall ranking is determined by this course. 00:03:07.060 --> 00:03:11.380 So you have to be a very good overall  athlete to be able to do decathlon. 00:03:11.380 --> 00:03:20.410 So the data looks like this. So that's the first  15 observations. The 100 meters is seconds,   00:03:20.410 --> 00:03:28.780 long jump how many meters, short put how  many meters, high jump how many meters,   00:03:28.780 --> 00:03:33.940 400 meters run how many seconds,  110 meter hurdles how many seconds,   00:03:33.940 --> 00:03:41.440 discus throw how far in meters you threw it, pole  jump how high how many meters, javelin how many   00:03:41.440 --> 00:03:48.100 metres you threw the javelin and then how many  seconds was the one and a half kilometer run. 00:03:48.100 --> 00:03:55.360 So what kind of dimensions does this data have?  That's what factor analysis will tell us. And   00:03:55.360 --> 00:04:01.990 we'll first do a factor analysis and we'll request  two factors just to get started with something. 00:04:01.990 --> 00:04:08.920 So that's the two factor solution and before I  explain the factors it's important to understand   00:04:08.920 --> 00:04:16.480 what do these numbers tell us. And let's  start with uniqueness and communality. 00:04:16.480 --> 00:04:25.750 So uniqueness and communality are sum to  100 or 1. And uniqueness or communality   00:04:25.750 --> 00:04:31.990 first tells how much of the variation of this  particular indicator the two factors explain. 00:04:31.990 --> 00:04:39.850 So for example short put there are  factors explained ninety four and   00:04:39.850 --> 00:04:45.430 half percent of the variation and  only 0.5% is unexplained. So the   00:04:45.430 --> 00:04:50.590 uniqueness is how much of the indicator  remains unexplained by the factors. 00:04:50.590 --> 00:04:57.340 Ideally if the factor model is correctly  specified - so that the factors perfectly   00:04:57.340 --> 00:05:04.420 match your theoretical constructs and the  indicator - there are no systematic measurement   00:05:04.420 --> 00:05:10.300 errors then this uniqueness here quantifies  the amount of random noise in the indicators. 00:05:10.300 --> 00:05:16.750 That's an ideal case. Whether that applies  in any real case that's another question. 00:05:17.620 --> 00:05:23.290 So this is... The commonality is kind of  measurement of reliability and this is   00:05:23.290 --> 00:05:26.650 an estimate of unreliability. So that's one way. 00:05:26.650 --> 00:05:34.780 Then we have two factors. We have our MR 1  and MR 2. The MR simply comes from the fact   00:05:34.780 --> 00:05:38.530 that we estimated min res technique you  don't have to care about what that means. 00:05:38.530 --> 00:05:47.620 So we have a first factor and second factor.  These are called factor loadings. And they   00:05:47.620 --> 00:05:53.590 are in correlation metric here. So the  idea here is that the first indicator   00:05:53.590 --> 00:05:59.800 correlates at minus 71 with the first  factor and minus 0.22 with the second   00:05:59.800 --> 00:06:04.870 factor. So the first indicator - first  variable is very strongly associated   00:06:04.870 --> 00:06:10.330 with the first factor and then a bit more  weakly associated with the second factor. 00:06:10.330 --> 00:06:13.630 So let's just take a look at the first factor now.   00:06:13.630 --> 00:06:19.060 The first factor here we first - we  identify that some of the indicators   00:06:19.060 --> 00:06:23.140 have negative factor loadings. We have  to understand why that is the case. 00:06:23.140 --> 00:06:31.600 If we start to look at those items that have  negative loadings - we have the 100 meter run,   00:06:31.600 --> 00:06:38.530 we have the 400 meter run, we have the  110 meter hurdles and then we have the   00:06:38.530 --> 00:06:45.100 1500 minute run. So all these are running  sports and what they have in common is that   00:06:45.100 --> 00:06:50.470 more time means that you're worse.  The less time means you're better. 00:06:50.470 --> 00:06:54.730 With all these others you are throwing something   00:06:54.730 --> 00:07:00.520 or you're you're jumping and the more  is better. So in these running sports   00:07:00.520 --> 00:07:05.770 less time is better - in these others  more distance more height is better. 00:07:05.770 --> 00:07:11.410 To make the results a bit more understandable I   00:07:11.410 --> 00:07:16.870 will therefore now reverse score the  times. So that all variables indicate   00:07:16.870 --> 00:07:21.220 more of a variable indicates that the  person - the athlete performs better. 00:07:21.220 --> 00:07:26.050 So I will reverse the signs of  these all running sports and   00:07:26.050 --> 00:07:28.150 then we have this kind of factor analysis result. 00:07:28.150 --> 00:07:32.830 We can see that every factor  - every indicator here - loads   00:07:32.830 --> 00:07:38.320 positively on the first factor and the  magnitude of the factor loadings differ. 00:07:38.320 --> 00:07:46.120 So how would we interpret the first factor?  All indicators are positively associated with   00:07:46.120 --> 00:07:52.750 something. What's the thing? We have to interpret  what is the underlying dimension that these   00:07:53.590 --> 00:07:58.990 influences - these dimensions at these indicators  are variables according to these results. 00:07:58.990 --> 00:08:07.180 This first factor - if everything correlates  positively with first factor - then the first   00:08:07.180 --> 00:08:11.980 factor basically is how good the guy is.  So how good of an athlete - the person   00:08:11.980 --> 00:08:19.210 is. If you are good athlete then you  perform better in all of these sports. 00:08:19.210 --> 00:08:23.410 So good athletes are expected to  perform better than bad athletes.   00:08:23.410 --> 00:08:26.200 Therefore all the items are positively correlated. 00:08:26.200 --> 00:08:38.170 The second attribute .- second factor here - we  can see that there are short put and javelin and   00:08:38.170 --> 00:08:44.770 these two are positively associated. 1 500  meters negatively associated as is all the   00:08:44.770 --> 00:08:52.060 other running sports. So the second factor  quantifies whether the person is better   00:08:52.060 --> 00:08:58.380 at sports that require strength versus  the sports that require running speed. 00:08:58.380 --> 00:09:04.110 So there is a trade-off if you are very  bulky guy - you're good in these strength   00:09:04.110 --> 00:09:09.300 sports but you're more mass therefore  you're not that great in the running   00:09:09.300 --> 00:09:14.010 sports. So there's a trade-off and this  second factor quantifies that trade-off. 00:09:14.010 --> 00:09:18.840 So we have a factor how good a guy  is and we have a factor of whether   00:09:18.840 --> 00:09:21.390 the guy is better at running or strength sports. 00:09:21.390 --> 00:09:28.020 That's not... We would ideally like to  think that there are two dimensions to   00:09:28.020 --> 00:09:34.710 this data. How good the guy is in running  and how good the guy is in these sports that   00:09:34.710 --> 00:09:39.360 require strength. But this factor analysis  solution doesn't answer that question. 00:09:39.360 --> 00:09:44.340 To answer that question we do something  called factor rotation. So the factor   00:09:44.340 --> 00:09:51.360 rotation is a technique that reorients the factor  solution so that it's simpler to interpret. 00:09:51.360 --> 00:09:57.990 Typically when you apply a factor analysis  and you have two correlated dimensions then   00:09:57.990 --> 00:10:02.430 the first factor will capture a little  bit of both dimensions. Like we have   00:10:02.430 --> 00:10:07.320 running speed and strengths captured  by the factor how good the guys and   00:10:07.320 --> 00:10:11.940 the second factor will captured then  whether the guy is better at running   00:10:11.940 --> 00:10:18.330 or whether at sports. When we reorient the  factor analysis using factor rotation then   00:10:18.330 --> 00:10:26.640 the factors will typically correspond  better to actual dimensions in the data. 00:10:26.640 --> 00:10:32.940 So here after rotation we have the first  factor strongly associated with all the   00:10:32.940 --> 00:10:41.160 running sports. So we have 0.84 here 0.7,  0.6 and so on. And then the second factor is   00:10:41.160 --> 00:10:47.670 strongly associated with sports that require  strength like the discus and the shotput. 00:10:47.670 --> 00:10:55.740 We can see that in a bit better by reordering  these indicators. So we reorder based on the first   00:10:55.740 --> 00:11:04.050 factor and we can see that the running sports are  all the five largest loadings. Then we have the   00:11:04.050 --> 00:11:10.680 pole jump and then we have the strength sports  here. The shotput, javelin and discus throw. 00:11:10.680 --> 00:11:15.990 The first factor now clear has an  interpretation. It is related to   00:11:15.990 --> 00:11:22.140 running. So that's the running skills or how  good a runner you are. And the second factor is   00:11:22.140 --> 00:11:27.840 a clear interpretation - it's related to these  strength sports and it's upper body strength. 00:11:27.840 --> 00:11:35.340 The pole vault requires both so it's loading both.  This is called a cross loading because it loads on   00:11:35.340 --> 00:11:41.370 two factors. First you have to run and then you  put the pole into the hole and then you have to   00:11:41.370 --> 00:11:47.790 use the upper body to use the pole and get as high  as possible. So pole vault requires both skills. 00:11:47.790 --> 00:11:55.050 We can see here also that high jump is  a high uniqueness. So it's not really   00:11:55.050 --> 00:11:58.980 related to upper body strength at all  and it's not really related to running   00:11:58.980 --> 00:12:03.900 speed because you don't have to run  fast you just run to pace yourself   00:12:03.900 --> 00:12:08.010 and then you jump up. So jumping  up is different from running fast. 00:12:08.010 --> 00:12:14.310 In long jump you have to - the better you are  running the faster you can get yourself going   00:12:14.310 --> 00:12:20.430 and the faster - the further you will jump -  fly when you jump. So that requires running. 00:12:20.430 --> 00:12:24.420 And this way we can interpret the  meaning - give meaning to these factors. 00:12:24.420 --> 00:12:29.910 So that was a two-factor solution. We can  of course get more than two factors. So   00:12:29.910 --> 00:12:37.830 there's quite a lot of unexplained variation  here. So a high jump 90 percent variation is   00:12:37.830 --> 00:12:41.400 unexplained by these two factors. So  we can try extracting more factors. 00:12:41.400 --> 00:12:47.400 And whether it makes sense to do  so is related to more what's your   00:12:47.400 --> 00:12:52.800 theoretical expectation and can you  actually interpret the factors instead   00:12:52.800 --> 00:12:57.390 of a statistical question of whether we can  explain more variation between the indicators. 00:12:57.390 --> 00:13:03.600 There are statistical techniques to decide  the number of factors but it is theoretical   00:13:03.600 --> 00:13:07.320 concern and it's about whether you  can interpret the result anymore. 00:13:07.320 --> 00:13:09.660 Let's try three factors and see what happens. 00:13:09.660 --> 00:13:16.410 So that's the rotated solution and I have  ordered the variables again according to   00:13:16.410 --> 00:13:20.550 the first factor loading and then the second  factor loading. So we have three factors now. 00:13:20.550 --> 00:13:25.980 The first factor is the same running speed  then the second factor is the same upper-body   00:13:25.980 --> 00:13:33.060 strength. So we have the strength sports  here and then we have a third factor that   00:13:33.060 --> 00:13:42.060 has the 1 500 meter run and the 400 meter run  and the long jump and not much else. So it's   00:13:42.060 --> 00:13:48.300 not about running speed as much as it's about  running stamina. So it's slightly different. 00:13:48.300 --> 00:13:51.750 So this is whether you're good  at running short distances that's   00:13:51.750 --> 00:13:56.340 explosive running speed and how fast you  accelerate things like that. And this   00:13:56.340 --> 00:14:01.680 is whether you can keep up the running.  And the upper-body strength is the same. 00:14:01.680 --> 00:14:08.040 So we can divide running further into two sub  dimensions. Whether it makes sense to do so is   00:14:08.040 --> 00:14:14.100 another question. In this case probably not.  Probably it's better to just say that some   00:14:14.100 --> 00:14:17.520 people are better at strength sports and  some people are better at running sports. 00:14:17.520 --> 00:14:25.470 We can also get four factors. We get the same  factors: running speed, upper body strength,   00:14:25.470 --> 00:14:34.320 running stamina and then the final  factor is simply high jump. So that   00:14:34.320 --> 00:14:37.530 receives its own factor and nothing  else slows on the high jump factor. 00:14:37.530 --> 00:14:42.990 So when we start extracting  factors typically we can go   00:14:42.990 --> 00:14:47.250 and get as many factors as we have  indicators and eventually we will   00:14:47.250 --> 00:14:51.000 get these factors that just explain  a single indicator and nothing more. 00:14:51.000 --> 00:14:56.130 So the idea of a factor is to try to find an  underlying dimensions from the data and once   00:14:56.130 --> 00:15:02.490 we start to get these factors that just tell  that how good the guy is in high jump - then   00:15:02.490 --> 00:15:07.380 it's not really a factor anymore in the  sense that it's an underlying dimension.   00:15:07.380 --> 00:15:13.530 So probably with this data three factors -  if we're really interested with the running   00:15:13.530 --> 00:15:18.360 stamina and running speed difference -could be  a good solution or we could just take the two   00:15:18.360 --> 00:15:23.730 factor solution which measures the running  skills and the strength of the athlete. 00:15:23.730 --> 00:15:31.020 So it's an argument. The choice of factors  depends on what's your research question   00:15:31.020 --> 00:15:33.630 and what kind of abstraction do  you want to have for your data. 00:15:33.630 --> 00:15:42.270 In practice when we apply factor analysis to  measurement scales - for example surveys - then   00:15:42.270 --> 00:15:47.370 we want to measure five different things with the  survey then we set the number of factors to five   00:15:47.370 --> 00:15:54.240 because we want to get five things from the data  and ideally the factor analysis demonstrates that   00:15:54.240 --> 00:16:00.570 the indicators correspond to the theoretical  constructs that they're supposed to measure. 00:16:00.570 --> 00:16:05.430 Factor analysis is based on the  correlation. So it is important - it's   00:16:05.430 --> 00:16:11.010 useful to understand the relation between  correlation matrix and factor analysis. 00:16:12.060 --> 00:16:17.580 The model implied correlations -  the same principle applies here   00:16:17.580 --> 00:16:22.110 as in regression model I'll cover  that a bit later. But here we can   00:16:22.110 --> 00:16:26.430 see that factor analysis groups the  indicators based on the correlations. 00:16:26.430 --> 00:16:30.840 So we have here first the running speed  factors. All the running sports are highly   00:16:30.840 --> 00:16:36.630 correlated. So they are reflections  of one underline running speed factor. 00:16:36.630 --> 00:16:47.250 Then we have these others. We have the upper body  strength. So those sports that require upper body   00:16:47.250 --> 00:16:52.740 strength are highly correlated. Then we have the  running stamina factor. So some of the running   00:16:52.740 --> 00:17:01.830 sports require both endurance and speed. And  then 1500 run requires endurance more than speed. 00:17:01.830 --> 00:17:09.000 And then we have high jump which is not loading  on any factors because it is very - really   00:17:09.000 --> 00:17:14.520 uncorrelated with any other sport. High jump  is a unique sport in that it doesn't really   00:17:14.520 --> 00:17:19.710 require strength and it doesn't require speed. It  requires the capability to just jump very high.