WEBVTT 00:00:00.920 --> 00:00:03.980 - Let's take a look at how you do computer assignments 00:00:03.980 --> 00:00:05.570 during the course. 00:00:05.570 --> 00:00:08.480 We have three computer assignments on the course, 00:00:08.480 --> 00:00:11.090 one is mandatory, and the second 00:00:11.090 --> 00:00:14.010 and third assignments are optional. 00:00:14.010 --> 00:00:16.720 These assignments are small analysis projects 00:00:16.720 --> 00:00:18.550 where you get a data file, 00:00:18.550 --> 00:00:22.010 or two data files in the final assignment, 00:00:22.010 --> 00:00:25.560 and you are supposed to answer a research question 00:00:25.560 --> 00:00:28.910 using those data files and analysis 00:00:28.910 --> 00:00:30.610 that we have taught on the course. 00:00:31.470 --> 00:00:32.580 For every submission, 00:00:32.580 --> 00:00:34.900 you need to submit two different files. 00:00:34.900 --> 00:00:37.140 So the analysis assignment submission 00:00:37.140 --> 00:00:40.090 always consists of two files. 00:00:40.090 --> 00:00:45.090 One is an analysis file R, SPSS or Stata file, 00:00:45.170 --> 00:00:47.050 and the idea of an analysis file 00:00:47.050 --> 00:00:49.510 is that it's a sequence of commands 00:00:49.510 --> 00:00:51.660 that the statistical software can run 00:00:51.660 --> 00:00:55.990 and it will reproduce whatever results you got 00:00:55.990 --> 00:00:58.610 if the person who runs the software 00:00:58.610 --> 00:01:00.660 has the same data set that you got. 00:01:00.660 --> 00:01:03.520 And then you need to run a report, 00:01:03.520 --> 00:01:05.210 the results in a word file 00:01:05.210 --> 00:01:08.010 where you add your own interpretation of the results. 00:01:08.010 --> 00:01:10.310 An explanation of the analysis workflow. 00:01:10.310 --> 00:01:12.350 I have more detailed explanation 00:01:12.350 --> 00:01:14.520 on how you actually do the word file, 00:01:14.520 --> 00:01:17.510 how do you do that analysis file on the course website 00:01:17.510 --> 00:01:19.810 for each of these three statistical software. 00:01:19.810 --> 00:01:23.670 But this point it's worth taking a moment 00:01:23.670 --> 00:01:27.660 to explain why you need to submit two files. 00:01:27.660 --> 00:01:29.910 Why just not the report? 00:01:29.910 --> 00:01:33.200 And this is important for reproducibility. 00:01:33.200 --> 00:01:36.680 Quite often, how people apply statistical software 00:01:36.680 --> 00:01:39.310 is that they point and click around, 00:01:39.310 --> 00:01:40.860 they get a nice regression table, 00:01:40.860 --> 00:01:43.073 they save the table, then point the click around, 00:01:43.073 --> 00:01:45.560 they get a nice correlation table, 00:01:45.560 --> 00:01:48.860 they save the table, then they point and click around 00:01:48.860 --> 00:01:50.900 and they get the third regression table, 00:01:50.900 --> 00:01:52.880 they save the table into a word file 00:01:52.880 --> 00:01:54.380 and then they write the paper. 00:01:55.640 --> 00:02:00.230 Good, then someone comes and takes their correlation table 00:02:00.230 --> 00:02:02.770 and tell us that your correlations 00:02:02.770 --> 00:02:07.030 will not produce the regression results that you provided. 00:02:07.030 --> 00:02:08.560 What do you do about it? 00:02:08.560 --> 00:02:10.560 Then the question is, 00:02:10.560 --> 00:02:13.600 how exactly did you come up with that correlation table 00:02:13.600 --> 00:02:16.580 and that regression table if it's impossible 00:02:16.580 --> 00:02:18.670 to come up with the results 00:02:18.670 --> 00:02:22.600 that you just presented from one data set? 00:02:22.600 --> 00:02:25.500 If you go with this point and click save a result 00:02:25.500 --> 00:02:26.650 or point and click 00:02:26.650 --> 00:02:28.670 save intermediate data sets, 00:02:28.670 --> 00:02:31.650 point and click, save a result approach 00:02:31.650 --> 00:02:34.220 unless you are very good at keeping a diary 00:02:34.220 --> 00:02:36.640 of what you did, it is impossible 00:02:36.640 --> 00:02:39.470 to go back say two weeks ago 00:02:39.470 --> 00:02:42.300 and say what you did in the analysis. 00:02:42.300 --> 00:02:45.410 Analysis file solve this problem. 00:02:45.410 --> 00:02:47.390 If you have an analysis file, 00:02:47.390 --> 00:02:50.340 then you have a complete documentation 00:02:50.340 --> 00:02:53.390 on what goes into your analysis. 00:02:53.390 --> 00:02:55.480 And it also helps you 00:02:55.480 --> 00:02:57.810 because if you are writing a journal paper 00:02:59.381 --> 00:03:02.480 there might be some changes that the reviewers 00:03:02.480 --> 00:03:04.410 want you to implement 00:03:04.410 --> 00:03:06.400 that could affect every table 00:03:06.400 --> 00:03:08.123 in your document. 00:03:09.100 --> 00:03:10.360 If you have an analysis file 00:03:10.360 --> 00:03:12.190 you simply implement the change, 00:03:12.190 --> 00:03:14.110 rerun the analysis file. 00:03:14.110 --> 00:03:17.000 If you did your tables with point and click 00:03:17.000 --> 00:03:18.600 good luck in trying to remember 00:03:18.600 --> 00:03:20.330 how exactly you did. 00:03:20.330 --> 00:03:22.480 I have personal experience on this 00:03:22.480 --> 00:03:24.280 and this is also something 00:03:24.280 --> 00:03:27.890 that you can find relevant in published research. 00:03:27.890 --> 00:03:30.040 I'll tell about my personal experience, 00:03:30.040 --> 00:03:32.920 when I really was mad at myself 00:03:32.920 --> 00:03:35.170 for not following a good research practice 00:03:35.170 --> 00:03:36.910 of doing analysis files. 00:03:36.910 --> 00:03:39.820 I had these structure (indistinct) modeling project, 00:03:39.820 --> 00:03:43.100 I had been thinking about it for a couple of days 00:03:43.100 --> 00:03:46.600 trying to get my world to work, it didn't work. 00:03:46.600 --> 00:03:49.470 And then finally at the end of the third day 00:03:49.470 --> 00:03:50.800 I got it to work. 00:03:50.800 --> 00:03:54.520 I was happy, I went home, I opened a beer at home, 00:03:54.520 --> 00:03:57.410 slept overnight, came to the office the next morning 00:03:57.410 --> 00:04:01.210 to write about the nice result that I got, 00:04:01.210 --> 00:04:03.683 and I did not remember how I did that. 00:04:04.620 --> 00:04:07.120 So I had a result, but I had no way 00:04:07.120 --> 00:04:08.460 of explaining how I did that 00:04:08.460 --> 00:04:10.720 because I had not documented the workflow. 00:04:10.720 --> 00:04:15.720 Then it took me two weeks, to replicate the steps 00:04:15.750 --> 00:04:18.900 that got me to the nice analysis result. 00:04:18.900 --> 00:04:20.960 And that at that point, I realized 00:04:20.960 --> 00:04:23.550 that the nice analysis result was a mistake. 00:04:23.550 --> 00:04:27.840 And I had this wasted three weeks of my time doing that. 00:04:28.750 --> 00:04:32.900 Another thing is that when you see papers, 00:04:32.900 --> 00:04:34.580 they have inconsistencies. 00:04:34.580 --> 00:04:37.890 They have results that are impossible. 00:04:37.890 --> 00:04:42.070 You have a complicated model that fits 00:04:42.070 --> 00:04:44.470 to the data worse than a simple model. 00:04:44.470 --> 00:04:47.020 That is mathematical impossibility. 00:04:47.020 --> 00:04:49.640 And that just indicates that there's something funny 00:04:49.640 --> 00:04:52.480 going on in an analysis that is published. 00:04:52.480 --> 00:04:56.730 But unless you have a step by step document 00:04:56.730 --> 00:04:58.990 of what the researchers did, 00:04:58.990 --> 00:05:03.280 there is no way of knowing that afterwards. 00:05:03.280 --> 00:05:07.210 I know this because sometimes when I see a result 00:05:07.210 --> 00:05:09.460 that can be true I email authors. 00:05:09.460 --> 00:05:11.790 Sometimes I get an analysis file back 00:05:11.790 --> 00:05:13.920 showing that this is what we did, 00:05:13.920 --> 00:05:16.510 and yeah, often times it makes sense. 00:05:16.510 --> 00:05:20.070 Other times I get the nones for saying 00:05:20.070 --> 00:05:23.940 that the authors don't remember what they did. 00:05:23.940 --> 00:05:26.990 And it's certainly not the ideal that we have results 00:05:26.990 --> 00:05:30.010 that not a single person in the world knows 00:05:30.010 --> 00:05:34.010 how they were calculated, published in a good job. 00:05:34.010 --> 00:05:36.320 So analysis files are important. 00:05:36.320 --> 00:05:40.710 And some journals are starting to require analysis files 00:05:40.710 --> 00:05:42.690 to be made available for reviewers. 00:05:42.690 --> 00:05:45.740 So unless you have an analysis file 00:05:45.740 --> 00:05:48.717 then some journals are out of the question for you. 00:05:49.800 --> 00:05:51.360 So that's the importance of analysis files. 00:05:51.360 --> 00:05:53.070 And it's really, really important 00:05:53.070 --> 00:05:55.950 that we start doing this early on, even for basic tasks. 00:05:55.950 --> 00:05:58.630 And this is why we do analysis files. 00:05:58.630 --> 00:06:01.400 Okay, so that's all about analysis files. 00:06:01.400 --> 00:06:05.410 Then I give you guidance in two different ways. 00:06:05.410 --> 00:06:07.490 There are two computer class. 00:06:07.490 --> 00:06:11.990 It's going to be a normal class where we are in person 00:06:11.990 --> 00:06:13.300 or it's going to be on Zoom 00:06:13.300 --> 00:06:15.990 depending on if we're in online or in-person format 00:06:15.990 --> 00:06:17.380 on the course. 00:06:17.380 --> 00:06:21.250 And then I have screencasts explaining the analysis. 00:06:21.250 --> 00:06:23.750 The first screencast are very basic. 00:06:23.750 --> 00:06:25.590 They hold your hand and show you 00:06:25.590 --> 00:06:27.490 how you implement the regression analysis 00:06:27.490 --> 00:06:29.840 using these different software. 00:06:29.840 --> 00:06:32.840 There are more advanced assignments. 00:06:32.840 --> 00:06:35.600 I give you the tools and then let you 00:06:35.600 --> 00:06:39.450 to put the answer together using those tools, 00:06:39.450 --> 00:06:42.120 so there's more of your own thinking required. 00:06:42.120 --> 00:06:45.600 Of course you can always do things differently 00:06:45.600 --> 00:06:48.040 from what I instructed but then you need 00:06:48.040 --> 00:06:49.990 to justify your approach. 00:06:49.990 --> 00:06:52.410 If you are interested in testing different models, 00:06:52.410 --> 00:06:55.143 sure, go ahead, I'll comment on them. 00:06:56.260 --> 00:07:01.060 So this is what the data analysis answer looks like. 00:07:01.060 --> 00:07:05.320 So you have assignment here, screencasts here, 00:07:05.320 --> 00:07:07.610 and then for this first assignment 00:07:07.610 --> 00:07:10.900 there is also a PDF description of the data set. 00:07:10.900 --> 00:07:15.370 And when you submit to turn it in, there are two tabs. 00:07:15.370 --> 00:07:20.040 There's the report tab which your word files should go into 00:07:20.040 --> 00:07:22.990 and then there's the analysis file tab 00:07:22.990 --> 00:07:24.630 to which your analysis file 00:07:24.630 --> 00:07:28.280 the R, Stata or SPSS file should go to. 00:07:28.280 --> 00:07:31.390 I only comment and grade the report 00:07:31.390 --> 00:07:34.910 but I want the analysis file to be available for checking 00:07:34.910 --> 00:07:38.510 if there is something weird going on in your report 00:07:38.510 --> 00:07:41.480 for example, if the number of observations, 00:07:41.480 --> 00:07:45.260 suits in between two regressions in an inextricable manner. 00:07:45.260 --> 00:07:48.640 Then I would want to see how the analysis was done. 00:07:48.640 --> 00:07:51.740 Sometimes I actually run the analysis 00:07:51.740 --> 00:07:56.300 that the students submit, and if you submit an analysis file 00:07:56.300 --> 00:07:59.530 that does not run, but submit the report, 00:07:59.530 --> 00:08:01.970 I generally tend to fail that report 00:08:01.970 --> 00:08:06.150 because it's not possible that you would have produced 00:08:06.150 --> 00:08:09.843 that report from the analysis file that you submitted.