WEBVTT WEBVTT Kind: captions Language: en 00:00:00.060 --> 00:00:04.140 Reliability and validity are two important  characteristics of good measurement. 00:00:04.140 --> 00:00:11.370 Reliability is fairly straightforward to evaluate  and fairly straightforward to define because it   00:00:11.370 --> 00:00:16.560 is simply whether you get the same result over  and over if you repeat the same measurement. 00:00:16.560 --> 00:00:23.010 Then you can use that consistency with repeated  measures to calculate an estimate of reliability. 00:00:23.010 --> 00:00:28.230 So that is fairly straightforward. The  issue of validity is much more complicated. 00:00:28.230 --> 00:00:33.510 Validity refers to whether your indicators  measure what they're supposed to measure. 00:00:33.510 --> 00:00:38.250 The problem is that because  we cannot observe the thing   00:00:38.250 --> 00:00:44.790 being measured directly - we cannot really  statistically assess whether the indicators   00:00:44.790 --> 00:00:50.430 correspond to the attribute or the trade  or the construct that we want to measure. 00:00:50.430 --> 00:00:53.790 So validity and validation are complicated topics   00:00:53.790 --> 00:00:57.210 and in this video I will introduce  you to some of that complexity. 00:00:57.210 --> 00:01:03.660 One thing that makes validity literature  difficult for a person who just start   00:01:03.660 --> 00:01:09.030 reading it - is that there are so many  different terms. So measurement validity   00:01:09.030 --> 00:01:14.310 is whether an indicator measures what  is supposed to measure. That is fairly   00:01:14.310 --> 00:01:19.800 straightforward to define. What exactly  that means it gets to some complications. 00:01:19.800 --> 00:01:25.680 But then there's all these are terminologies.  You have face validity, content validity,   00:01:26.580 --> 00:01:30.270 convergent validity, the discriminant  validity, normal logical validity and so on. 00:01:30.270 --> 00:01:34.200 So there are so many different  terms. Do you have to understand   00:01:34.200 --> 00:01:38.730 all these?Are these facets of  validity that all have to apply?   00:01:38.730 --> 00:01:44.250 Are they different on definitions?  Are they contradictory and so on. 00:01:44.250 --> 00:01:49.500 One way to understand this literature -  start to understand this literature is to   00:01:49.500 --> 00:01:54.960 understanding there's difference  between validity and validation. 00:01:54.960 --> 00:02:00.480 So validity refers to whether the indicator  measures what it is supposed to measure.   00:02:00.480 --> 00:02:07.230 Validation refers to different ways  that we can argue or assess validity. 00:02:07.230 --> 00:02:13.720 And these concepts are mostly focused on  validation. Danny Borsboom's article in   00:02:13.720 --> 00:02:24.250 psychological review notes that these terms  originated from questions such as asking people   00:02:24.250 --> 00:02:30.760 whether they think that the measurement is valid.  So that's a way of validation that led to term   00:02:30.760 --> 00:02:36.700 face validity whether the measure can predict  something useful that is predictive validity. 00:02:36.700 --> 00:02:41.800 So it's about validation more than about  validity and these are two different things. 00:02:41.800 --> 00:02:47.170 So how do we argue validity and how do we  define validity are two different things. 00:02:47.170 --> 00:02:53.230 If you just look at the definition  of validity and then the things are   00:02:53.230 --> 00:02:57.040 much simpler because you don't  have to understand most of this. 00:02:57.040 --> 00:03:01.090 But there are important terms  that you need to understand   00:03:01.090 --> 00:03:05.650 because they are commonly used.  I will now explain three of them. 00:03:05.650 --> 00:03:13.120 These originated from psychometric text from  1960s or at least the new knowledge book   00:03:13.120 --> 00:03:19.390 from 1960s is commonly cited as a source for  these terms and that made these terms popular. 00:03:19.390 --> 00:03:25.180 So are these content validity, predictive  validate and construct validity - are they   00:03:25.180 --> 00:03:31.900 actually about validity or validation or are they  competing concepts or the complementary concepts?   00:03:31.900 --> 00:03:36.550 Do you have to demonstrate all of these in  your study or do you have to focus on one? 00:03:36.550 --> 00:03:41.470 Let's take a look at what these  concepts actually mean. So these   00:03:41.470 --> 00:03:47.410 are different things. The idea of content  validity is that your indicators in your   00:03:47.410 --> 00:03:52.960 scale measure all different aspects  or dimensions of the phenomenon. 00:03:52.960 --> 00:04:03.040 A typical example is a math exam. So if you do a  math exam then it has to cover all the content of   00:04:03.040 --> 00:04:08.860 the course. So if you have an elementary school  math exam there is a subtractions multiplications   00:04:08.860 --> 00:04:14.050 or divisions and sums that you have to  calculate. So you have four different things. 00:04:14.050 --> 00:04:20.680 If you only cover subtractions then you lack  content validity. So it's whether the indicator   00:04:20.680 --> 00:04:29.110 summarize some dimension or some domain that the  test or exam is supposed to summarize. So content   00:04:29.110 --> 00:04:35.860 validity is mostly focused on educational  measurement or something where you have to   00:04:35.860 --> 00:04:42.670 summarize people's capabilities or skills in a  certain domain of things with a single score. 00:04:42.670 --> 00:04:51.220 Predictive validity is about prediction  or forecasting. And forecasting means   00:04:51.220 --> 00:04:55.120 that can you actually based on your  data say something about the future.   00:04:55.120 --> 00:04:59.200 It's not measurement. Prediction and  measurement are two different things. 00:04:59.200 --> 00:05:07.120 A typical example is college entry exams. They are  not designed to measure who is good at school who   00:05:07.120 --> 00:05:14.110 is smart or something else. They are designed to  predict who is going to do well in the college and   00:05:14.110 --> 00:05:19.360 who is going to graduate. Because the college  is not as interested in getting people who are   00:05:19.360 --> 00:05:24.640 smart or hard-working. It's interested in  getting people who are going to graduate. 00:05:24.640 --> 00:05:32.500 Then we have construct validity and this is about  concept measurement but it is a special kind of   00:05:32.500 --> 00:05:39.610 validation techniques. Construct validity is not  the definition of measurement validity instead   00:05:39.610 --> 00:05:45.790 it is a validation technique and why that's  the case becomes clear on this next slide. 00:05:45.790 --> 00:05:53.890 So the idea of construct validity is that there  is a nomological network. The nomological network   00:05:53.890 --> 00:06:01.240 is a network of constructs and their theoretical  relationships. For example - the example given   00:06:01.240 --> 00:06:07.210 by Borsboom and colleagues is that we have  intelligence as our focal construct then we   00:06:07.210 --> 00:06:12.040 have general knowledge as another construct  and criminal behavior as another construct. 00:06:12.040 --> 00:06:17.290 We have a strong hypothesis that  intelligence is negatively associated   00:06:17.290 --> 00:06:20.800 with criminal behavior and positively  associated with general knowledge. 00:06:20.800 --> 00:06:29.470 The idea of construct validity or construct  validation is that we assess or measure   00:06:29.470 --> 00:06:37.210 intelligence. Let's say we use an IQ score  and we check if the IQ score correlates   00:06:37.210 --> 00:06:43.120 positively general knowledge examination score  and negatively with length of criminal record. 00:06:43.120 --> 00:06:49.630 So the idea is that we have this theoretical  world here - the nomological network - and   00:06:49.630 --> 00:06:54.400 we have the empirical world here - our  measured correlations - and then we check   00:06:54.400 --> 00:06:59.560 whether the measured correlations from our  data matches these theoretical expectations. 00:06:59.560 --> 00:07:08.980 So whatever our measure is here it is valid  - construct valid - if these relationships   00:07:08.980 --> 00:07:14.770 between the measured scores correspond  to the relationships that we theorize. 00:07:14.770 --> 00:07:21.310 This is somewhat useful way of  assessing validity. So if your   00:07:21.310 --> 00:07:28.660 scores don't behave as expected then that's  one reason to either doubt the validity of   00:07:28.660 --> 00:07:32.950 scores or doubt the correctness  of your theory. So that's useful. 00:07:32.950 --> 00:07:41.170 But this is a very limited also because consider  if you have a very green field of study. So you're   00:07:41.170 --> 00:07:46.810 studying something that hasn't been theorized  much before so where exactly would you get this   00:07:46.810 --> 00:07:52.480 nomological network. If you're the first person  to introduce a new construct due to your field   00:07:52.480 --> 00:07:57.640 then how exactly are you going to argue that  that construct has an established relationship   00:07:57.640 --> 00:08:01.540 with other constructs because there is  no existing research on that construct. 00:08:01.540 --> 00:08:07.360 But this is... So basically the idea  of construct validity is whether these   00:08:07.360 --> 00:08:18.760 empirical correlations are good representations  or proxies of these theoretical relationships. 00:08:18.760 --> 00:08:26.890 One important thing that construct  validity and these other two commonly   00:08:26.890 --> 00:08:31.810 used validated terms don't address is  that they don't really address what   00:08:31.810 --> 00:08:35.860 is the relationship between your  data and your theoretical concept. 00:08:35.860 --> 00:08:44.950 So content validity basically just addresses  whether these data cover the content of the thing   00:08:44.950 --> 00:08:51.820 that you're studying. So does your math test cover  all the things that was taught during the course. 00:08:51.820 --> 00:08:57.700 Predictive validity is does this course  predict something. So those two are not   00:08:57.700 --> 00:09:01.390 about theoretical concepts at  all. So predictive validity   00:09:01.390 --> 00:09:05.470 and content validity - there is there's no  theoretical concepts in their definitions. 00:09:05.470 --> 00:09:14.230 Construct validity has the term construct in  the name and it also concerns the theoretical   00:09:14.230 --> 00:09:20.980 concept. But it doesn't address whether  the data corresponds to the theoretical   00:09:20.980 --> 00:09:25.960 concept. It only addresses whether the  relationships between the variables   00:09:25.960 --> 00:09:30.280 correspond to the relationships  between the theoretical concepts. 00:09:30.280 --> 00:09:34.210 That is interesting but it doesn't  really address how the theoretical   00:09:34.210 --> 00:09:40.030 concepts are related to the data.  So that is beyond these terms. 00:09:40.030 --> 00:09:46.510 There is a... So how do we define validity?  One good candidate definition is that we   00:09:46.510 --> 00:09:53.320 define test as a valid if the attribute  being tested or measured exists. So we   00:09:53.320 --> 00:09:59.170 assume that the construct exists independent of  measurement and that is the realist perspective   00:09:59.170 --> 00:10:07.660 of measurement. Then we claim that the variation  in observed data is due to the variation of the   00:10:07.660 --> 00:10:12.640 constructor. So there is a variation in the  construct. Let's say there is the construct   00:10:12.640 --> 00:10:19.630 intelligence. Some people are more intelligent  than others and there is variation in IQ scores. 00:10:19.630 --> 00:10:26.470 We say that the IQ score is a valid measure of  intelligence if the variation in the intelligence   00:10:26.470 --> 00:10:38.740 causes variation in the scores. In other terms  or other words some people perform better in IQ   00:10:38.740 --> 00:10:43.180 tests because they're more intelligent. Some  people perform worse because they are less   00:10:43.180 --> 00:10:48.610 intelligent. So that's the idea of variation  in construct causes variation in the observed   00:10:48.610 --> 00:10:55.000 data. And so the observed data is of course a  function of construct and some measurement error. 00:10:56.440 --> 00:11:04.030 That's an easy definition. What is difficult  is to argue how your scores are actually   00:11:04.030 --> 00:11:09.320 valid. So validation is the hard part.  Defining validity this way is very simple. 00:11:09.320 --> 00:11:16.100 So how exactly do you validate and what  do you have to write into your paper to   00:11:16.100 --> 00:11:21.500 convince your readers that your measures  are valid. To understand that let's take   00:11:21.500 --> 00:11:27.830 a look at - compare these latent variable  model for validity and construct validity. 00:11:27.830 --> 00:11:34.700 So the construct validity perspective is  more about epistemology. So it's what can   00:11:34.700 --> 00:11:38.840 we learn from the correlations in our  data. Can we use the correlations in   00:11:38.840 --> 00:11:41.300 our data to learn something about the constructs? 00:11:41.300 --> 00:11:46.040 That is a useful way of validation but it  doesn't really address whether the test is   00:11:46.040 --> 00:11:51.260 valid. Then the latent variable theory  presented in the last previous slide is   00:11:51.260 --> 00:11:58.790 about ontology. So does the attribute exist  and does the variations in that attribute   00:11:58.790 --> 00:12:02.960 produce variation in a test score. So these  are different. The focus is slightly different. 00:12:02.960 --> 00:12:08.540 The concepts of focus here in construct validity  is in the correlations. So it's the meaning of   00:12:08.540 --> 00:12:15.080 what the correlations mean. Can we generalize from  observed correlation to a theoretical correlation. 00:12:15.080 --> 00:12:21.200 In the latent variable model the  idea is on a reference. So do the   00:12:21.200 --> 00:12:25.790 indicators - the variables - actually refer  to any real entity? We have to argue that. 00:12:25.790 --> 00:12:29.930 Then the empirical focus is correlations. In   00:12:29.930 --> 00:12:35.510 construct validity we check the correlations  between our data and if those correlations   00:12:35.510 --> 00:12:42.560 match with the theoretical expectations  we conclude that the test is valid. 00:12:42.560 --> 00:12:49.520 In latent variable theory we have to argue  the causation. So validation here is not a   00:12:49.520 --> 00:12:55.520 methodological problem but a substantive problem.  So we have to really argue why we think that our   00:12:55.520 --> 00:13:03.860 IQ test or innovation score actually varies  because the construct being measured varies. 00:13:03.860 --> 00:13:10.910 So we have to explain ideally what is the  mechanism of variation and how do exactly   00:13:10.910 --> 00:13:16.070 person's intelligence for example  influence how they do in IQ scores. 00:13:16.070 --> 00:13:24.260 This is of course a lot more challenging task and  it places more emphasis on validation studies and   00:13:24.260 --> 00:13:28.070 the theoretical part of the validation  study whereas construct validation is   00:13:28.070 --> 00:13:32.930 simply about calculating correlations and see  whether they match empirical expectations. 00:13:32.930 --> 00:13:41.480 Both are useful because if your measures  don't behave as expected that's a reason   00:13:41.480 --> 00:13:45.800 to suspect that the measures may not  be valid but ultimately that is a not   00:13:45.800 --> 00:13:50.510 sufficient to claim validity. You have  to claim - look at the causal process. 00:13:50.510 --> 00:13:57.080 We can also take a look at how the  latent variable theory differs from   00:13:57.080 --> 00:14:00.470 classical test theory which gives  us the definition of reliability. 00:14:00.470 --> 00:14:10.070 The idea is that classical test theory  is a psychometric model. It's not the   00:14:10.070 --> 00:14:16.310 measurement theory so the scope is much  more narrow. It's a model that describes   00:14:16.310 --> 00:14:22.550 how people respond to surveys or how they  respond to different psychological tests. 00:14:22.550 --> 00:14:27.350 Latent variable theory is about measurement  theory and it takes the realist ontology.   00:14:27.350 --> 00:14:33.620 Classical test theory doesn't really say  anything about ontology so it doesn't   00:14:33.620 --> 00:14:38.630 say whether the scores measure anything. It  only gives us a reliability and true score. 00:14:38.630 --> 00:14:44.630 Then latent variable theories focus  on validity and construct measurement. 00:14:44.630 --> 00:14:52.760 The equations for these are two models  can look similar. So classical test   00:14:52.760 --> 00:14:57.950 theory is explicitly defined as an  equation so the observed scores are   00:14:57.950 --> 00:15:03.140 deterministic linear combination of  true score plus some random noise. 00:15:03.140 --> 00:15:08.390 In the latent variable theory this is more  general. It's just saying that variation   00:15:08.390 --> 00:15:15.650 in the construct scores causes variation in  observed scores. The statistical - there's   00:15:15.650 --> 00:15:21.710 therefore some kind of statistical association  between the construct and the measure but it   00:15:21.710 --> 00:15:28.550 may not been necessarily linear. So we can model  other kinds of relationships and this takes the   00:15:28.550 --> 00:15:33.380 relationship - the statistical model - simply as  an approximation for the causal relationships. 00:15:33.380 --> 00:15:41.160 Then the true score of construct influence  in different indicators in classical test   00:15:41.160 --> 00:15:48.000 theory - we take it as an assumption that  the true score influences all indicators   00:15:48.000 --> 00:15:53.520 equally. So if we eliminate all random  noise in the data then all the indicators   00:15:53.520 --> 00:15:57.510 are going to be exactly the same because  they share the same true score. This is   00:15:57.510 --> 00:16:01.200 called the Tau equivalence assumption.  Tau is for the true score in Greek. 00:16:01.200 --> 00:16:07.680 Then here in latent variable theory we  just say that the indicators - the various   00:16:07.680 --> 00:16:14.250 indicators - depends on the variation of the  construct but we don't really make any explicit   00:16:14.250 --> 00:16:19.710 claims about how the dependency manifests  statistically. So different indicators might   00:16:19.710 --> 00:16:24.210 depend differently on the construct. Some  may be more sensitive to certain levels of   00:16:24.210 --> 00:16:29.400 the construct and others and this allows  us to do all kinds of statistical models   00:16:29.400 --> 00:16:34.140 particularly the IRT or item response theory  models are based on this kind of thinking. 00:16:34.140 --> 00:16:40.320 Measurement error in these models is... Classical  test theory is simply about random noise and   00:16:40.320 --> 00:16:47.310 individual items. Then in latent variable  theory we can have all kinds of sources of   00:16:47.310 --> 00:16:53.100 measurement error but the key thing that we  have to argue is that the construct actually   00:16:53.100 --> 00:16:58.500 is a cause of the indicators or the variance of  the construct is a cause of the variance of the   00:16:58.500 --> 00:17:03.450 indicators. And that is much more challenging  to do than simply assessing reliability. 00:17:03.450 --> 00:17:10.140 Here's a one very simple way  that we can use this approach.   00:17:10.140 --> 00:17:13.920 There are latent variable model to  assess reliability and validity. 00:17:13.920 --> 00:17:20.340 So if we take the assumption that linear  or statistical associations are useful   00:17:20.340 --> 00:17:28.770 for assessing causal relationships then  we could say that the observed score is   00:17:28.770 --> 00:17:34.200 a function of the construct score - we use  T here - plus some systematic measurement   00:17:34.200 --> 00:17:38.430 error plus some random noise. So there are  different causal influences to the true   00:17:38.430 --> 00:17:45.720 score or the construct score that we are  estimating with that kind of model here. 00:17:45.720 --> 00:17:52.320 So we have error in reliability and we  also have the systematic error in validity. 00:17:52.320 --> 00:18:00.270 The problem of course here is that if you  have unique random noise and then you have an   00:18:00.270 --> 00:18:06.900 indicator that is unique then it may be difficult  to know whether the indicators measurement there   00:18:06.900 --> 00:18:12.180 is actually validity error or reliability error.  So oftentimes you can really say which one it is. 00:18:12.180 --> 00:18:19.620 Then a summary of all this. We don't  really have any proofs of measurement   00:18:19.620 --> 00:18:24.810 validity. So validation is more  of a substantive argument than a   00:18:24.810 --> 00:18:30.960 statistical argument. Nevertheless we can  say that if many indicators or two or more   00:18:30.960 --> 00:18:34.890 indicators are highly correlated then  they may be measuring the same thing. 00:18:34.890 --> 00:18:39.270 We just don't know what the thing is and  we have to argue based on a theory that   00:18:39.270 --> 00:18:45.390 the construct actually causes certain kind  of behavior in people and then that's how   00:18:45.390 --> 00:18:51.900 we argue the validity. Then it's possible that  the indicators correlate for some other reason   00:18:51.900 --> 00:18:59.460 and if measure behaves as expected with  respect to other measures it may be valid. 00:18:59.460 --> 00:19:04.920 So that's the construct validity way  of validating things and it's a useful   00:19:04.920 --> 00:19:11.910 technique but you shouldn't rely on it as your  only technique. And typically with the latent   00:19:11.910 --> 00:19:18.240 variable theory you work with this kind of  models so you specify one latent variable as   00:19:18.240 --> 00:19:24.330 a source of variation of multiple indicators  and this is called the common factor model. 00:19:24.330 --> 00:19:28.260 So it's a factor analysis  model and that's commonly   00:19:28.260 --> 00:19:30.720 used with this kind of validity framework. 00:19:30.720 --> 00:19:36.660 This is a very complicated topic. If you want to  study more about validity I can recommend you two   00:19:36.660 --> 00:19:41.940 good books. I like the writings of D. Borsboom.  So he has written a book called measuring in the   00:19:41.940 --> 00:19:48.960 mind which is an introductory level book. So you  can read that after reading for example the scale   00:19:48.960 --> 00:19:54.540 development which gives you an overview. And once  you have read that book then you can look at more   00:19:54.540 --> 00:19:59.580 challenging text such as frontiers in tests  validity theory by Keith Marcus and Borsboom   00:19:59.580 --> 00:20:07.260 which summarizes a broad range of validity  literature and it's fairly condensed. So that's   00:20:07.260 --> 00:20:11.790 probably not best for the first book but it's a  really great overview of test validity theory.