WEBVTT WEBVTT Kind: captions Language: en 00:00:00.060 --> 00:00:04.320 it is important that our study  results are sufficiently reliable. 00:00:04.320 --> 00:00:10.650 Also it is important to be able to argue  reliability based on empirical data. So   00:00:10.650 --> 00:00:18.060 how exactly do we assess reliability and how  exactly we argue that our results are reliable? 00:00:18.060 --> 00:00:23.910 Before we go into that there is one thing that I  want to address. There are called rules of thumb.   00:00:23.910 --> 00:00:30.600 Particularly with respect to reliability and  measurement validity there is this tendency for   00:00:30.600 --> 00:00:36.930 researchers to think that if you have a statistic  that exceeds a particle threshold then everything   00:00:36.930 --> 00:00:43.260 is okay. If the same statistic falls just below  the threshold then the study is worthless. 00:00:43.260 --> 00:00:50.970 So this kind of yes or no thinking is not  ideal and particularly you cannot really   00:00:50.970 --> 00:00:56.100 justify that kind of yes or no based  on any good methodological resource. 00:00:56.100 --> 00:01:01.800 Many authors tend aside new knowledge  book on psychometric theory for the rule   00:01:01.800 --> 00:01:08.460 of thumb of 0.74 for cronbach's alpha which is  coefficient alpha which is reliability statistic. 00:01:08.460 --> 00:01:13.860 Well the problem is that he doesn't make  that kind of claims in his book. Instead   00:01:13.860 --> 00:01:19.620 reliability is something that you have to  take into consideration if your study measures   00:01:19.620 --> 00:01:25.230 are eighty percent reliable sometimes that is  enough. Sometimes eighty percent is not enough. 00:01:25.230 --> 00:01:31.200 You have to explain to your reader what does it  mean. So what kind of bias do you expect if you   00:01:31.200 --> 00:01:36.540 have 70 percent reliability. What kind of bias  do you expect if you have 95 percent reliable. 00:01:36.540 --> 00:01:43.650 Is it probable or not? Is not a fact matter of  exceeding a certain cutoff instead it is a matter   00:01:43.650 --> 00:01:49.980 of understanding what reliability means for your  results and then explaining that to your readers. 00:01:49.980 --> 00:01:55.440 So before we talked about these actual statistics   00:01:55.440 --> 00:02:00.150 it is important to understand what  kind of assumptions the reliability   00:02:00.150 --> 00:02:04.770 statistics are based on and what is  the principle of assessing reliability. 00:02:04.770 --> 00:02:08.250 With the bathroom scale example that I used in the   00:02:08.250 --> 00:02:13.410 previous video it is very simple. You  are measuring he same person again with   00:02:13.410 --> 00:02:18.390 the same scale - if you get the same  result then your measure is reliable. 00:02:18.390 --> 00:02:23.850 When we measure people or organisations through  surveying people for example things are a bit   00:02:23.850 --> 00:02:30.660 more complicated. The reason is that if we ask  a person whether they like for example United   00:02:30.660 --> 00:02:37.665 Nations then and we ask the person again if  they like United Nations. The second answer   00:02:37.665 --> 00:02:43.740 to the question is influenced by the previous  answer. So if we ask the person same question   00:02:43.740 --> 00:02:48.780 over and over they will give us the same answer  because that's how they answered the last time. 00:02:48.780 --> 00:02:52.440 So whereas a bathroom scale doesn't remember   00:02:52.440 --> 00:02:56.790 what the previous measure was -  people do and that's a problem. 00:02:56.790 --> 00:03:03.690 So classical test theory has this concept  of parallel tests. The idea of a parallel   00:03:03.690 --> 00:03:10.920 test is a hypothetical scenario where we  would measure the same person again without   00:03:10.920 --> 00:03:15.690 that person having any recollection  of the previous measurement or case. 00:03:15.690 --> 00:03:22.530 so an example here is that if we ask Mr. Brown  whether he likes United Nations or not then if   00:03:22.530 --> 00:03:27.270 we ask him the same question again we  have to brainwash Mr. Brown in between   00:03:27.270 --> 00:03:32.580 those two questions so they are really  independent tests of the same attribute. 00:03:32.580 --> 00:03:40.440 This of course is a counterfactual argument  because we cannot brainwash our subjects. Our   00:03:40.440 --> 00:03:45.900 subjects will know what they answered the last  time. So if we ask the survey question we ask   00:03:45.900 --> 00:03:51.720 the next question how the person answers the  next question will be influenced by how they   00:03:51.720 --> 00:03:55.650 answered the first question. So we simply  cannot ask the same question over and over. 00:03:55.650 --> 00:04:00.990 There are 2 workarounds for this  problem that we really cannot do   00:04:00.990 --> 00:04:06.210 these parallel tests. Test the same  attribute of the same person at the   00:04:06.210 --> 00:04:11.550 same occasion or without the person having  any recollection of being tested before. 00:04:11.550 --> 00:04:18.720 There are... The two ways are we either  do actual replications and we assume that   00:04:18.720 --> 00:04:25.260 they are parallel. So that will work if we  have a time delay. For example if we ask a   00:04:25.260 --> 00:04:30.330 person now whether they like United Nations  we ask them the same question a week after   00:04:30.330 --> 00:04:35.370 then they may not remember anymore what  the original answer was. In which case we   00:04:35.370 --> 00:04:41.310 could argue that those repeated measures  are mimic the parallel tests scenario. 00:04:41.310 --> 00:04:49.830 Another way is to assume that we do two distinct  measures. So we measure the same question in a   00:04:49.830 --> 00:04:53.550 different way. We measure the same attribute  in a different way and we assumed that those   00:04:53.550 --> 00:04:58.710 two different ways of measuring the same  thing are parallel. So instead of asking   00:04:58.710 --> 00:05:04.290 the person whether he likes United Nations  or not - we could ask him whether he thinks   00:05:04.290 --> 00:05:08.430 that United Nation is the best thing that  has ever happened to mankind for example. 00:05:08.430 --> 00:05:13.710 So we measured the same thing again but  slightly differently. So with that way   00:05:13.710 --> 00:05:18.900 we could say that the second way the second  measurement is not as much influenced by the   00:05:18.900 --> 00:05:23.220 first measurement as it would be if we just  repeated the same question over and over. 00:05:23.220 --> 00:05:30.780 The first approach by repeating the  exact study except measurement again   00:05:30.780 --> 00:05:38.340 with a time delay is called test-retest  reliability. So the idea is that if the   00:05:38.340 --> 00:05:43.710 attribute that we're measuring is  relatively stable over time - then   00:05:43.710 --> 00:05:49.170 if a person answers or tests differently at  the different occasion then the only reason   00:05:49.170 --> 00:05:54.870 for the difference between the two tests is  unreliability because the trait is stable. 00:05:54.870 --> 00:06:01.950 Also we have to make the assumption that  errors are independent which is justified   00:06:01.950 --> 00:06:06.330 by the time delay. So you don't remember what  you answered the last time because there's a   00:06:06.330 --> 00:06:14.040 time delay. And if we... Example here would  be that if we measure a child that wiggles   00:06:14.040 --> 00:06:19.770 or if the measurements are done in a matter of  seconds the true way does not change. We cannot   00:06:19.770 --> 00:06:28.740 argue test-retest reliability in that case  with for example one year time delay. So we   00:06:28.740 --> 00:06:34.710 can't measure a child at five years and a child  at six years and then say that the weights from   00:06:34.710 --> 00:06:41.490 those two measurements differ is the evidence of  unreliability. That would not be valid evidence of   00:06:41.490 --> 00:06:46.290 unreliability because we cannot assume that  the trait is stable over such long period. 00:06:46.290 --> 00:06:52.770 So you have to consider how quickly  the trait or the thing that is being   00:06:52.770 --> 00:06:59.220 measured changes over time and how  quickly people reset by forgetting   00:06:59.220 --> 00:07:03.990 that they were tested or how exactly they  answered the question in the first place. 00:07:03.990 --> 00:07:05.100 So that's test-retest reliability. 00:07:05.100 --> 00:07:11.370 Let's take a look at the example of test-retest  reliability from Yli-Renko's paper. They are   00:07:11.370 --> 00:07:18.930 say that they are asked slightly different  question again with a two year delay on the   00:07:18.930 --> 00:07:30.000 key construct and the study was about small  companies and they are two-year delay of   00:07:30.000 --> 00:07:34.920 course here for that to be valid you would  have to assume that nothing changes within   00:07:34.920 --> 00:07:44.640 small companies in two years time. That  is of course not the valid assumption. 00:07:44.640 --> 00:07:50.430 So we can't make that assumption here  that the trade doesn't change. So this   00:07:50.430 --> 00:07:55.590 would not be a valid test-retest. It would  be a valid if you did a survey of a business   00:07:55.590 --> 00:08:03.240 organization if there is like a two week or  a month time delay. Then you could reasonably   00:08:03.240 --> 00:08:07.530 assume that there are no major changes  but if you have a two year delay - like   00:08:07.530 --> 00:08:14.730 in this paper - then that is not a very  good test- retest reliability estimate. 00:08:14.730 --> 00:08:22.770 So test-retest is you measure the same thing  again with a time delay that is appropriate for   00:08:22.770 --> 00:08:28.410 your measure and the trait being measured  so that it allows people to reset between   00:08:28.410 --> 00:08:33.690 measurements but the trade doesn't really  change substantially between the measurements. 00:08:35.130 --> 00:08:40.650 This is a not as commonly used because of  course if you do two rounds of a survey   00:08:40.650 --> 00:08:44.580 study it is more expensive than  to do just one round of a survey   00:08:44.580 --> 00:08:49.170 study. So we actually use more commonly  another way which is the distinct test. 00:08:49.170 --> 00:08:56.700 The reason for having multiple survey questions  that look the same or look like they would measure   00:08:56.700 --> 00:09:03.660 the same thing - the reason for that is that we  actually think that they are distinct tests. So   00:09:03.660 --> 00:09:07.620 that's the most common reason for using multiple  survey questions to measure the same thing. 00:09:07.620 --> 00:09:13.230 For example we could ask the company -  the person to rate whether the company   00:09:13.230 --> 00:09:18.420 is innovative or not whether they're the  technological leaders in the industry   00:09:18.420 --> 00:09:22.560 and whether they are the first one to  bring new product concepts to markets. 00:09:22.560 --> 00:09:28.320 We could argue that these are distinct  questions. So you don't answer the same   00:09:28.320 --> 00:09:32.280 the second question similarly to the first  question because these are really different   00:09:32.280 --> 00:09:37.110 questions. But they do measure the same  trait that's the argument we have to make. 00:09:37.110 --> 00:09:44.250 So the idea of distinct tests is that we  generate tests that are not the same. So   00:09:44.250 --> 00:09:49.380 they're sufficiently different but we could  still argue that they all measure the same thing. 00:09:49.380 --> 00:09:55.980 And how we use the data from these multiple  distinct tests produces different ways of   00:09:55.980 --> 00:10:00.840 assessing reliability. So there is internal  consistent method alternative forms method   00:10:00.840 --> 00:10:06.510 and split half method. Understanding  exactly what these all do is not important. 00:10:06.510 --> 00:10:10.920 It is important to understand the principle  and then understand a couple of statistics   00:10:10.920 --> 00:10:14.790 that you can calculate from the data and  then understand their interpretations. 00:10:14.790 --> 00:10:23.430 The really important part here is that the tests  really have to be distinct. So if you're just   00:10:23.430 --> 00:10:28.260 asking the same question over and over with  slightly different wording for example our   00:10:28.260 --> 00:10:32.700 firm is very innovative our company is very  innovative and our business organization is   00:10:32.700 --> 00:10:38.220 very innovative. These are not distinct tests  it just asking the same question over and over   00:10:38.220 --> 00:10:43.560 with slightly different wording. And this  is something that you see very commonly as   00:10:43.560 --> 00:10:50.340 a reviewer. So authors are just writing questions  that are the same without paying much attention to   00:10:50.340 --> 00:10:58.020 the distinctiveness of these questions and that's  a big problem that I see in management research.