WEBVTT

00:00:00.920 --> 00:00:03.980
- Let's take a look at how
you do computer assignments

00:00:03.980 --> 00:00:05.570
during the course.

00:00:05.570 --> 00:00:08.480
We have three computer
assignments on the course,

00:00:08.480 --> 00:00:11.090
one is mandatory, and the second

00:00:11.090 --> 00:00:14.010
and third assignments are optional.

00:00:14.010 --> 00:00:16.720
These assignments are
small analysis projects

00:00:16.720 --> 00:00:18.550
where you get a data file,

00:00:18.550 --> 00:00:22.010
or two data files in the final assignment,

00:00:22.010 --> 00:00:25.560
and you are supposed to
answer a research question

00:00:25.560 --> 00:00:28.910
using those data files and analysis

00:00:28.910 --> 00:00:30.610
that we have taught on the course.

00:00:31.470 --> 00:00:32.580
For every submission,

00:00:32.580 --> 00:00:34.900
you need to submit two different files.

00:00:34.900 --> 00:00:37.140
So the analysis assignment submission

00:00:37.140 --> 00:00:40.090
always consists of two files.

00:00:40.090 --> 00:00:45.090
One is an analysis file
R, SPSS or Stata file,

00:00:45.170 --> 00:00:47.050
and the idea of an analysis file

00:00:47.050 --> 00:00:49.510
is that it's a sequence of commands

00:00:49.510 --> 00:00:51.660
that the statistical software can run

00:00:51.660 --> 00:00:55.990
and it will reproduce
whatever results you got

00:00:55.990 --> 00:00:58.610
if the person who runs the software

00:00:58.610 --> 00:01:00.660
has the same data set that you got.

00:01:00.660 --> 00:01:03.520
And then you need to run a report,

00:01:03.520 --> 00:01:05.210
the results in a word file

00:01:05.210 --> 00:01:08.010
where you add your own
interpretation of the results.

00:01:08.010 --> 00:01:10.310
An explanation of the analysis workflow.

00:01:10.310 --> 00:01:12.350
I have more detailed explanation

00:01:12.350 --> 00:01:14.520
on how you actually do the word file,

00:01:14.520 --> 00:01:17.510
how do you do that analysis
file on the course website

00:01:17.510 --> 00:01:19.810
for each of these three
statistical software.

00:01:19.810 --> 00:01:23.670
But this point it's worth taking a moment

00:01:23.670 --> 00:01:27.660
to explain why you need
to submit two files.

00:01:27.660 --> 00:01:29.910
Why just not the report?

00:01:29.910 --> 00:01:33.200
And this is important for reproducibility.

00:01:33.200 --> 00:01:36.680
Quite often, how people
apply statistical software

00:01:36.680 --> 00:01:39.310
is that they point and click around,

00:01:39.310 --> 00:01:40.860
they get a nice regression table,

00:01:40.860 --> 00:01:43.073
they save the table, then
point the click around,

00:01:43.073 --> 00:01:45.560
they get a nice correlation table,

00:01:45.560 --> 00:01:48.860
they save the table, then
they point and click around

00:01:48.860 --> 00:01:50.900
and they get the third regression table,

00:01:50.900 --> 00:01:52.880
they save the table into a word file

00:01:52.880 --> 00:01:54.380
and then they write the paper.

00:01:55.640 --> 00:02:00.230
Good, then someone comes and
takes their correlation table

00:02:00.230 --> 00:02:02.770
and tell us that your correlations

00:02:02.770 --> 00:02:07.030
will not produce the regression
results that you provided.

00:02:07.030 --> 00:02:08.560
What do you do about it?

00:02:08.560 --> 00:02:10.560
Then the question is,

00:02:10.560 --> 00:02:13.600
how exactly did you come up
with that correlation table

00:02:13.600 --> 00:02:16.580
and that regression
table if it's impossible

00:02:16.580 --> 00:02:18.670
to come up with the results

00:02:18.670 --> 00:02:22.600
that you just presented from one data set?

00:02:22.600 --> 00:02:25.500
If you go with this point
and click save a result

00:02:25.500 --> 00:02:26.650
or point and click

00:02:26.650 --> 00:02:28.670
save intermediate data sets,

00:02:28.670 --> 00:02:31.650
point and click, save a result approach

00:02:31.650 --> 00:02:34.220
unless you are very
good at keeping a diary

00:02:34.220 --> 00:02:36.640
of what you did, it is impossible

00:02:36.640 --> 00:02:39.470
to go back say two weeks ago

00:02:39.470 --> 00:02:42.300
and say what you did in the analysis.

00:02:42.300 --> 00:02:45.410
Analysis file solve this problem.

00:02:45.410 --> 00:02:47.390
If you have an analysis file,

00:02:47.390 --> 00:02:50.340
then you have a complete documentation

00:02:50.340 --> 00:02:53.390
on what goes into your analysis.

00:02:53.390 --> 00:02:55.480
And it also helps you

00:02:55.480 --> 00:02:57.810
because if you are writing a journal paper

00:02:59.381 --> 00:03:02.480
there might be some
changes that the reviewers

00:03:02.480 --> 00:03:04.410
want you to implement

00:03:04.410 --> 00:03:06.400
that could affect every table

00:03:06.400 --> 00:03:08.123
in your document.

00:03:09.100 --> 00:03:10.360
If you have an analysis file

00:03:10.360 --> 00:03:12.190
you simply implement the change,

00:03:12.190 --> 00:03:14.110
rerun the analysis file.

00:03:14.110 --> 00:03:17.000
If you did your tables
with point and click

00:03:17.000 --> 00:03:18.600
good luck in trying to remember

00:03:18.600 --> 00:03:20.330
how exactly you did.

00:03:20.330 --> 00:03:22.480
I have personal experience on this

00:03:22.480 --> 00:03:24.280
and this is also something

00:03:24.280 --> 00:03:27.890
that you can find relevant
in published research.

00:03:27.890 --> 00:03:30.040
I'll tell about my personal experience,

00:03:30.040 --> 00:03:32.920
when I really was mad at myself

00:03:32.920 --> 00:03:35.170
for not following a good research practice

00:03:35.170 --> 00:03:36.910
of doing analysis files.

00:03:36.910 --> 00:03:39.820
I had these structure
(indistinct) modeling project,

00:03:39.820 --> 00:03:43.100
I had been thinking about
it for a couple of days

00:03:43.100 --> 00:03:46.600
trying to get my world
to work, it didn't work.

00:03:46.600 --> 00:03:49.470
And then finally at the
end of the third day

00:03:49.470 --> 00:03:50.800
I got it to work.

00:03:50.800 --> 00:03:54.520
I was happy, I went home,
I opened a beer at home,

00:03:54.520 --> 00:03:57.410
slept overnight, came to
the office the next morning

00:03:57.410 --> 00:04:01.210
to write about the nice result that I got,

00:04:01.210 --> 00:04:03.683
and I did not remember how I did that.

00:04:04.620 --> 00:04:07.120
So I had a result, but I had no way

00:04:07.120 --> 00:04:08.460
of explaining how I did that

00:04:08.460 --> 00:04:10.720
because I had not documented the workflow.

00:04:10.720 --> 00:04:15.720
Then it took me two weeks,
to replicate the steps

00:04:15.750 --> 00:04:18.900
that got me to the nice analysis result.

00:04:18.900 --> 00:04:20.960
And that at that point, I realized

00:04:20.960 --> 00:04:23.550
that the nice analysis
result was a mistake.

00:04:23.550 --> 00:04:27.840
And I had this wasted three
weeks of my time doing that.

00:04:28.750 --> 00:04:32.900
Another thing is that when you see papers,

00:04:32.900 --> 00:04:34.580
they have inconsistencies.

00:04:34.580 --> 00:04:37.890
They have results that are impossible.

00:04:37.890 --> 00:04:42.070
You have a complicated model that fits

00:04:42.070 --> 00:04:44.470
to the data worse than a simple model.

00:04:44.470 --> 00:04:47.020
That is mathematical impossibility.

00:04:47.020 --> 00:04:49.640
And that just indicates
that there's something funny

00:04:49.640 --> 00:04:52.480
going on in an analysis that is published.

00:04:52.480 --> 00:04:56.730
But unless you have a
step by step document

00:04:56.730 --> 00:04:58.990
of what the researchers did,

00:04:58.990 --> 00:05:03.280
there is no way of
knowing that afterwards.

00:05:03.280 --> 00:05:07.210
I know this because
sometimes when I see a result

00:05:07.210 --> 00:05:09.460
that can be true I email authors.

00:05:09.460 --> 00:05:11.790
Sometimes I get an analysis file back

00:05:11.790 --> 00:05:13.920
showing that this is what we did,

00:05:13.920 --> 00:05:16.510
and yeah, often times it makes sense.

00:05:16.510 --> 00:05:20.070
Other times I get the nones for saying

00:05:20.070 --> 00:05:23.940
that the authors don't
remember what they did.

00:05:23.940 --> 00:05:26.990
And it's certainly not the
ideal that we have results

00:05:26.990 --> 00:05:30.010
that not a single person
in the world knows

00:05:30.010 --> 00:05:34.010
how they were calculated,
published in a good job.

00:05:34.010 --> 00:05:36.320
So analysis files are important.

00:05:36.320 --> 00:05:40.710
And some journals are starting
to require analysis files

00:05:40.710 --> 00:05:42.690
to be made available for reviewers.

00:05:42.690 --> 00:05:45.740
So unless you have an analysis file

00:05:45.740 --> 00:05:48.717
then some journals are out
of the question for you.

00:05:49.800 --> 00:05:51.360
So that's the importance
of analysis files.

00:05:51.360 --> 00:05:53.070
And it's really, really important

00:05:53.070 --> 00:05:55.950
that we start doing this early
on, even for basic tasks.

00:05:55.950 --> 00:05:58.630
And this is why we do analysis files.

00:05:58.630 --> 00:06:01.400
Okay, so that's all about analysis files.

00:06:01.400 --> 00:06:05.410
Then I give you guidance
in two different ways.

00:06:05.410 --> 00:06:07.490
There are two computer class.

00:06:07.490 --> 00:06:11.990
It's going to be a normal
class where we are in person

00:06:11.990 --> 00:06:13.300
or it's going to be on Zoom

00:06:13.300 --> 00:06:15.990
depending on if we're in
online or in-person format

00:06:15.990 --> 00:06:17.380
on the course.

00:06:17.380 --> 00:06:21.250
And then I have screencasts
explaining the analysis.

00:06:21.250 --> 00:06:23.750
The first screencast are very basic.

00:06:23.750 --> 00:06:25.590
They hold your hand and show you

00:06:25.590 --> 00:06:27.490
how you implement the regression analysis

00:06:27.490 --> 00:06:29.840
using these different software.

00:06:29.840 --> 00:06:32.840
There are more advanced assignments.

00:06:32.840 --> 00:06:35.600
I give you the tools and then let you

00:06:35.600 --> 00:06:39.450
to put the answer together
using those tools,

00:06:39.450 --> 00:06:42.120
so there's more of your
own thinking required.

00:06:42.120 --> 00:06:45.600
Of course you can always
do things differently

00:06:45.600 --> 00:06:48.040
from what I instructed but then you need

00:06:48.040 --> 00:06:49.990
to justify your approach.

00:06:49.990 --> 00:06:52.410
If you are interested in
testing different models,

00:06:52.410 --> 00:06:55.143
sure, go ahead, I'll comment on them.

00:06:56.260 --> 00:07:01.060
So this is what the data
analysis answer looks like.

00:07:01.060 --> 00:07:05.320
So you have assignment
here, screencasts here,

00:07:05.320 --> 00:07:07.610
and then for this first assignment

00:07:07.610 --> 00:07:10.900
there is also a PDF
description of the data set.

00:07:10.900 --> 00:07:15.370
And when you submit to turn
it in, there are two tabs.

00:07:15.370 --> 00:07:20.040
There's the report tab which
your word files should go into

00:07:20.040 --> 00:07:22.990
and then there's the analysis file tab

00:07:22.990 --> 00:07:24.630
to which your analysis file

00:07:24.630 --> 00:07:28.280
the R, Stata or SPSS file should go to.

00:07:28.280 --> 00:07:31.390
I only comment and grade the report

00:07:31.390 --> 00:07:34.910
but I want the analysis file
to be available for checking

00:07:34.910 --> 00:07:38.510
if there is something weird
going on in your report

00:07:38.510 --> 00:07:41.480
for example, if the
number of observations,

00:07:41.480 --> 00:07:45.260
suits in between two regressions
in an inextricable manner.

00:07:45.260 --> 00:07:48.640
Then I would want to see
how the analysis was done.

00:07:48.640 --> 00:07:51.740
Sometimes I actually run the analysis

00:07:51.740 --> 00:07:56.300
that the students submit, and
if you submit an analysis file

00:07:56.300 --> 00:07:59.530
that does not run, but submit the report,

00:07:59.530 --> 00:08:01.970
I generally tend to fail that report

00:08:01.970 --> 00:08:06.150
because it's not possible
that you would have produced

00:08:06.150 --> 00:08:09.843
that report from the analysis
file that you submitted.