WEBVTT
WEBVTT
Kind: captions
Language: en
00:00:00.030 --> 00:00:04.200
Factor analysis is a very useful
tool for validating measurement.
00:00:04.200 --> 00:00:09.390
The idea of factor analysis is that
they take in multiple indicators and
00:00:09.390 --> 00:00:12.960
then it answers the question what
do the indicators have in common.
00:00:12.960 --> 00:00:18.840
So it tries to extract or identify
underlying dimensions from your data.
00:00:18.840 --> 00:00:24.810
The reason why we use factor analysis for
measurement is that before we apply any
00:00:24.810 --> 00:00:28.740
reliability statistics we have
to study if the indicators are
00:00:28.740 --> 00:00:33.690
uni-dimensional - if so then we use a
uni dimensional reliability index - if
00:00:33.690 --> 00:00:38.820
not then we calculate the reliability
statistic based on the factor analysis.
00:00:38.820 --> 00:00:45.000
Factor analysis also can be used to assess
the hypothesis that the indicators are
00:00:45.000 --> 00:00:51.180
consequences of a common cause and in that
way we can justify - try to use the factor
00:00:51.180 --> 00:00:57.720
analysis to justify causal claims where we
say that the construct causes multiple items.
00:00:57.720 --> 00:01:02.430
The factor analysis techniques
are - there are two main variants:
00:01:02.430 --> 00:01:06.060
exploratory factor analysis and
confirmatory factor analysis.
00:01:06.060 --> 00:01:13.890
In exploratory factor analysis the core - it's an
exploratory process where you give the computer
00:01:13.890 --> 00:01:20.820
your dataset and then you ask the computer to
give you three factors two factors or how many
00:01:20.820 --> 00:01:26.070
factors you can - you want to have from the data
- and then the computer will identify the factors.
00:01:26.070 --> 00:01:33.000
In confirmatory factor analysis you specify
the factor structure yourself. So you say
00:01:33.000 --> 00:01:37.500
the first three indicators for example
measure one thing that is one factor
00:01:37.500 --> 00:01:42.570
then the second three measure another thing
that's a factor and then the remaining four
00:01:42.570 --> 00:01:47.190
indicators measure a third thing and that's
up the third factor and then the computer
00:01:47.190 --> 00:01:53.490
will estimate the model for you and tell
if that model is plausible for the data.
00:01:53.490 --> 00:01:59.730
Exploratory factor analysis is easy to apply
because you don't have to specify the structure
00:01:59.730 --> 00:02:04.530
yourself - you just specify the number of
indicators and which variables you use and
00:02:04.530 --> 00:02:11.310
for that reason many people get started with the
exploratory factor analysis instead and if you do
00:02:11.310 --> 00:02:17.170
data exploration or some initial analysis then
exploratory factor analysis is quicker to do.
00:02:17.170 --> 00:02:21.100
Then exploratory analysis is the one that is
00:02:21.100 --> 00:02:24.550
typically covered first followed
by confirmatory factor analysis.
00:02:24.550 --> 00:02:30.760
I will demonstrate factor analysis using
the exploratory approach and to do that we
00:02:30.760 --> 00:02:37.330
need some data and our data are from Olympic
decathlon. So we have the ten sports that the
00:02:37.330 --> 00:02:46.270
athletes do that are 100 meters run, long
jump, shotput, high jump, 400 meters run,
00:02:46.270 --> 00:02:54.130
110 meter hurdles, discus throw, pole
vault, javelin throw and 1500 meters run.
00:02:54.130 --> 00:03:01.420
So there are 10 different sports that you do
in this competition and then you are rated
00:03:01.420 --> 00:03:07.060
based on your performance and all. And the
overall ranking is determined by this course.
00:03:07.060 --> 00:03:11.380
So you have to be a very good overall
athlete to be able to do decathlon.
00:03:11.380 --> 00:03:20.410
So the data looks like this. So that's the first
15 observations. The 100 meters is seconds,
00:03:20.410 --> 00:03:28.780
long jump how many meters, short put how
many meters, high jump how many meters,
00:03:28.780 --> 00:03:33.940
400 meters run how many seconds,
110 meter hurdles how many seconds,
00:03:33.940 --> 00:03:41.440
discus throw how far in meters you threw it, pole
jump how high how many meters, javelin how many
00:03:41.440 --> 00:03:48.100
metres you threw the javelin and then how many
seconds was the one and a half kilometer run.
00:03:48.100 --> 00:03:55.360
So what kind of dimensions does this data have?
That's what factor analysis will tell us. And
00:03:55.360 --> 00:04:01.990
we'll first do a factor analysis and we'll request
two factors just to get started with something.
00:04:01.990 --> 00:04:08.920
So that's the two factor solution and before I
explain the factors it's important to understand
00:04:08.920 --> 00:04:16.480
what do these numbers tell us. And let's
start with uniqueness and communality.
00:04:16.480 --> 00:04:25.750
So uniqueness and communality are sum to
100 or 1. And uniqueness or communality
00:04:25.750 --> 00:04:31.990
first tells how much of the variation of this
particular indicator the two factors explain.
00:04:31.990 --> 00:04:39.850
So for example short put there are
factors explained ninety four and
00:04:39.850 --> 00:04:45.430
half percent of the variation and
only 0.5% is unexplained. So the
00:04:45.430 --> 00:04:50.590
uniqueness is how much of the indicator
remains unexplained by the factors.
00:04:50.590 --> 00:04:57.340
Ideally if the factor model is correctly
specified - so that the factors perfectly
00:04:57.340 --> 00:05:04.420
match your theoretical constructs and the
indicator - there are no systematic measurement
00:05:04.420 --> 00:05:10.300
errors then this uniqueness here quantifies
the amount of random noise in the indicators.
00:05:10.300 --> 00:05:16.750
That's an ideal case. Whether that applies
in any real case that's another question.
00:05:17.620 --> 00:05:23.290
So this is... The commonality is kind of
measurement of reliability and this is
00:05:23.290 --> 00:05:26.650
an estimate of unreliability. So that's one way.
00:05:26.650 --> 00:05:34.780
Then we have two factors. We have our MR 1
and MR 2. The MR simply comes from the fact
00:05:34.780 --> 00:05:38.530
that we estimated min res technique you
don't have to care about what that means.
00:05:38.530 --> 00:05:47.620
So we have a first factor and second factor.
These are called factor loadings. And they
00:05:47.620 --> 00:05:53.590
are in correlation metric here. So the
idea here is that the first indicator
00:05:53.590 --> 00:05:59.800
correlates at minus 71 with the first
factor and minus 0.22 with the second
00:05:59.800 --> 00:06:04.870
factor. So the first indicator - first
variable is very strongly associated
00:06:04.870 --> 00:06:10.330
with the first factor and then a bit more
weakly associated with the second factor.
00:06:10.330 --> 00:06:13.630
So let's just take a look at the first factor now.
00:06:13.630 --> 00:06:19.060
The first factor here we first - we
identify that some of the indicators
00:06:19.060 --> 00:06:23.140
have negative factor loadings. We have
to understand why that is the case.
00:06:23.140 --> 00:06:31.600
If we start to look at those items that have
negative loadings - we have the 100 meter run,
00:06:31.600 --> 00:06:38.530
we have the 400 meter run, we have the
110 meter hurdles and then we have the
00:06:38.530 --> 00:06:45.100
1500 minute run. So all these are running
sports and what they have in common is that
00:06:45.100 --> 00:06:50.470
more time means that you're worse.
The less time means you're better.
00:06:50.470 --> 00:06:54.730
With all these others you are throwing something
00:06:54.730 --> 00:07:00.520
or you're you're jumping and the more
is better. So in these running sports
00:07:00.520 --> 00:07:05.770
less time is better - in these others
more distance more height is better.
00:07:05.770 --> 00:07:11.410
To make the results a bit more understandable I
00:07:11.410 --> 00:07:16.870
will therefore now reverse score the
times. So that all variables indicate
00:07:16.870 --> 00:07:21.220
more of a variable indicates that the
person - the athlete performs better.
00:07:21.220 --> 00:07:26.050
So I will reverse the signs of
these all running sports and
00:07:26.050 --> 00:07:28.150
then we have this kind of factor analysis result.
00:07:28.150 --> 00:07:32.830
We can see that every factor
- every indicator here - loads
00:07:32.830 --> 00:07:38.320
positively on the first factor and the
magnitude of the factor loadings differ.
00:07:38.320 --> 00:07:46.120
So how would we interpret the first factor?
All indicators are positively associated with
00:07:46.120 --> 00:07:52.750
something. What's the thing? We have to interpret
what is the underlying dimension that these
00:07:53.590 --> 00:07:58.990
influences - these dimensions at these indicators
are variables according to these results.
00:07:58.990 --> 00:08:07.180
This first factor - if everything correlates
positively with first factor - then the first
00:08:07.180 --> 00:08:11.980
factor basically is how good the guy is.
So how good of an athlete - the person
00:08:11.980 --> 00:08:19.210
is. If you are good athlete then you
perform better in all of these sports.
00:08:19.210 --> 00:08:23.410
So good athletes are expected to
perform better than bad athletes.
00:08:23.410 --> 00:08:26.200
Therefore all the items are positively correlated.
00:08:26.200 --> 00:08:38.170
The second attribute .- second factor here - we
can see that there are short put and javelin and
00:08:38.170 --> 00:08:44.770
these two are positively associated. 1 500
meters negatively associated as is all the
00:08:44.770 --> 00:08:52.060
other running sports. So the second factor
quantifies whether the person is better
00:08:52.060 --> 00:08:58.380
at sports that require strength versus
the sports that require running speed.
00:08:58.380 --> 00:09:04.110
So there is a trade-off if you are very
bulky guy - you're good in these strength
00:09:04.110 --> 00:09:09.300
sports but you're more mass therefore
you're not that great in the running
00:09:09.300 --> 00:09:14.010
sports. So there's a trade-off and this
second factor quantifies that trade-off.
00:09:14.010 --> 00:09:18.840
So we have a factor how good a guy
is and we have a factor of whether
00:09:18.840 --> 00:09:21.390
the guy is better at running or strength sports.
00:09:21.390 --> 00:09:28.020
That's not... We would ideally like to
think that there are two dimensions to
00:09:28.020 --> 00:09:34.710
this data. How good the guy is in running
and how good the guy is in these sports that
00:09:34.710 --> 00:09:39.360
require strength. But this factor analysis
solution doesn't answer that question.
00:09:39.360 --> 00:09:44.340
To answer that question we do something
called factor rotation. So the factor
00:09:44.340 --> 00:09:51.360
rotation is a technique that reorients the factor
solution so that it's simpler to interpret.
00:09:51.360 --> 00:09:57.990
Typically when you apply a factor analysis
and you have two correlated dimensions then
00:09:57.990 --> 00:10:02.430
the first factor will capture a little
bit of both dimensions. Like we have
00:10:02.430 --> 00:10:07.320
running speed and strengths captured
by the factor how good the guys and
00:10:07.320 --> 00:10:11.940
the second factor will captured then
whether the guy is better at running
00:10:11.940 --> 00:10:18.330
or whether at sports. When we reorient the
factor analysis using factor rotation then
00:10:18.330 --> 00:10:26.640
the factors will typically correspond
better to actual dimensions in the data.
00:10:26.640 --> 00:10:32.940
So here after rotation we have the first
factor strongly associated with all the
00:10:32.940 --> 00:10:41.160
running sports. So we have 0.84 here 0.7,
0.6 and so on. And then the second factor is
00:10:41.160 --> 00:10:47.670
strongly associated with sports that require
strength like the discus and the shotput.
00:10:47.670 --> 00:10:55.740
We can see that in a bit better by reordering
these indicators. So we reorder based on the first
00:10:55.740 --> 00:11:04.050
factor and we can see that the running sports are
all the five largest loadings. Then we have the
00:11:04.050 --> 00:11:10.680
pole jump and then we have the strength sports
here. The shotput, javelin and discus throw.
00:11:10.680 --> 00:11:15.990
The first factor now clear has an
interpretation. It is related to
00:11:15.990 --> 00:11:22.140
running. So that's the running skills or how
good a runner you are. And the second factor is
00:11:22.140 --> 00:11:27.840
a clear interpretation - it's related to these
strength sports and it's upper body strength.
00:11:27.840 --> 00:11:35.340
The pole vault requires both so it's loading both.
This is called a cross loading because it loads on
00:11:35.340 --> 00:11:41.370
two factors. First you have to run and then you
put the pole into the hole and then you have to
00:11:41.370 --> 00:11:47.790
use the upper body to use the pole and get as high
as possible. So pole vault requires both skills.
00:11:47.790 --> 00:11:55.050
We can see here also that high jump is
a high uniqueness. So it's not really
00:11:55.050 --> 00:11:58.980
related to upper body strength at all
and it's not really related to running
00:11:58.980 --> 00:12:03.900
speed because you don't have to run
fast you just run to pace yourself
00:12:03.900 --> 00:12:08.010
and then you jump up. So jumping
up is different from running fast.
00:12:08.010 --> 00:12:14.310
In long jump you have to - the better you are
running the faster you can get yourself going
00:12:14.310 --> 00:12:20.430
and the faster - the further you will jump -
fly when you jump. So that requires running.
00:12:20.430 --> 00:12:24.420
And this way we can interpret the
meaning - give meaning to these factors.
00:12:24.420 --> 00:12:29.910
So that was a two-factor solution. We can
of course get more than two factors. So
00:12:29.910 --> 00:12:37.830
there's quite a lot of unexplained variation
here. So a high jump 90 percent variation is
00:12:37.830 --> 00:12:41.400
unexplained by these two factors. So
we can try extracting more factors.
00:12:41.400 --> 00:12:47.400
And whether it makes sense to do
so is related to more what's your
00:12:47.400 --> 00:12:52.800
theoretical expectation and can you
actually interpret the factors instead
00:12:52.800 --> 00:12:57.390
of a statistical question of whether we can
explain more variation between the indicators.
00:12:57.390 --> 00:13:03.600
There are statistical techniques to decide
the number of factors but it is theoretical
00:13:03.600 --> 00:13:07.320
concern and it's about whether you
can interpret the result anymore.
00:13:07.320 --> 00:13:09.660
Let's try three factors and see what happens.
00:13:09.660 --> 00:13:16.410
So that's the rotated solution and I have
ordered the variables again according to
00:13:16.410 --> 00:13:20.550
the first factor loading and then the second
factor loading. So we have three factors now.
00:13:20.550 --> 00:13:25.980
The first factor is the same running speed
then the second factor is the same upper-body
00:13:25.980 --> 00:13:33.060
strength. So we have the strength sports
here and then we have a third factor that
00:13:33.060 --> 00:13:42.060
has the 1 500 meter run and the 400 meter run
and the long jump and not much else. So it's
00:13:42.060 --> 00:13:48.300
not about running speed as much as it's about
running stamina. So it's slightly different.
00:13:48.300 --> 00:13:51.750
So this is whether you're good
at running short distances that's
00:13:51.750 --> 00:13:56.340
explosive running speed and how fast you
accelerate things like that. And this
00:13:56.340 --> 00:14:01.680
is whether you can keep up the running.
And the upper-body strength is the same.
00:14:01.680 --> 00:14:08.040
So we can divide running further into two sub
dimensions. Whether it makes sense to do so is
00:14:08.040 --> 00:14:14.100
another question. In this case probably not.
Probably it's better to just say that some
00:14:14.100 --> 00:14:17.520
people are better at strength sports and
some people are better at running sports.
00:14:17.520 --> 00:14:25.470
We can also get four factors. We get the same
factors: running speed, upper body strength,
00:14:25.470 --> 00:14:34.320
running stamina and then the final
factor is simply high jump. So that
00:14:34.320 --> 00:14:37.530
receives its own factor and nothing
else slows on the high jump factor.
00:14:37.530 --> 00:14:42.990
So when we start extracting
factors typically we can go
00:14:42.990 --> 00:14:47.250
and get as many factors as we have
indicators and eventually we will
00:14:47.250 --> 00:14:51.000
get these factors that just explain
a single indicator and nothing more.
00:14:51.000 --> 00:14:56.130
So the idea of a factor is to try to find an
underlying dimensions from the data and once
00:14:56.130 --> 00:15:02.490
we start to get these factors that just tell
that how good the guy is in high jump - then
00:15:02.490 --> 00:15:07.380
it's not really a factor anymore in the
sense that it's an underlying dimension.
00:15:07.380 --> 00:15:13.530
So probably with this data three factors -
if we're really interested with the running
00:15:13.530 --> 00:15:18.360
stamina and running speed difference -could be
a good solution or we could just take the two
00:15:18.360 --> 00:15:23.730
factor solution which measures the running
skills and the strength of the athlete.
00:15:23.730 --> 00:15:31.020
So it's an argument. The choice of factors
depends on what's your research question
00:15:31.020 --> 00:15:33.630
and what kind of abstraction do
you want to have for your data.
00:15:33.630 --> 00:15:42.270
In practice when we apply factor analysis to
measurement scales - for example surveys - then
00:15:42.270 --> 00:15:47.370
we want to measure five different things with the
survey then we set the number of factors to five
00:15:47.370 --> 00:15:54.240
because we want to get five things from the data
and ideally the factor analysis demonstrates that
00:15:54.240 --> 00:16:00.570
the indicators correspond to the theoretical
constructs that they're supposed to measure.
00:16:00.570 --> 00:16:05.430
Factor analysis is based on the
correlation. So it is important - it's
00:16:05.430 --> 00:16:11.010
useful to understand the relation between
correlation matrix and factor analysis.
00:16:12.060 --> 00:16:17.580
The model implied correlations -
the same principle applies here
00:16:17.580 --> 00:16:22.110
as in regression model I'll cover
that a bit later. But here we can
00:16:22.110 --> 00:16:26.430
see that factor analysis groups the
indicators based on the correlations.
00:16:26.430 --> 00:16:30.840
So we have here first the running speed
factors. All the running sports are highly
00:16:30.840 --> 00:16:36.630
correlated. So they are reflections
of one underline running speed factor.
00:16:36.630 --> 00:16:47.250
Then we have these others. We have the upper body
strength. So those sports that require upper body
00:16:47.250 --> 00:16:52.740
strength are highly correlated. Then we have the
running stamina factor. So some of the running
00:16:52.740 --> 00:17:01.830
sports require both endurance and speed. And
then 1500 run requires endurance more than speed.
00:17:01.830 --> 00:17:09.000
And then we have high jump which is not loading
on any factors because it is very - really
00:17:09.000 --> 00:17:14.520
uncorrelated with any other sport. High jump
is a unique sport in that it doesn't really
00:17:14.520 --> 00:17:19.710
require strength and it doesn't require speed. It
requires the capability to just jump very high.