WEBVTT

00:00:01.300 --> 00:00:02.330
- During the course,

00:00:02.330 --> 00:00:03.370
you will need to complete

00:00:03.370 --> 00:00:06.150
at least one data analysis assignment.

00:00:06.150 --> 00:00:09.260
And I thought it's a good
idea to start the course

00:00:09.260 --> 00:00:11.130
by discussing a little bit

00:00:11.130 --> 00:00:13.440
about the different software
choices that you make.

00:00:13.440 --> 00:00:16.230
So you must choose at
least one of these software

00:00:16.230 --> 00:00:17.990
to use for the first assignment,

00:00:17.990 --> 00:00:21.400
you can of course use multiple software

00:00:21.400 --> 00:00:23.510
for different assignments if you want to,

00:00:23.510 --> 00:00:25.870
and I have some people who
have come to the course

00:00:25.870 --> 00:00:29.350
later on to complete it
with a different software.

00:00:29.350 --> 00:00:31.730
We are using three different
software on the course.

00:00:31.730 --> 00:00:35.727
We have our SPSS, Stata, and R.

00:00:35.727 --> 00:00:38.330
And these are fairly different.

00:00:38.330 --> 00:00:41.790
So they, you can complete the course fully

00:00:41.790 --> 00:00:43.510
using any of these software.

00:00:43.510 --> 00:00:46.420
I have very strong opinions
on which of this software

00:00:46.420 --> 00:00:47.320
you should apply

00:00:47.320 --> 00:00:50.590
if you want to be a
professional researcher.

00:00:50.590 --> 00:00:51.960
But let's take a look at first,

00:00:51.960 --> 00:00:53.110
what's a statistical software

00:00:53.110 --> 00:00:55.750
and how does it differ from Excel.

00:00:55.750 --> 00:00:56.760
So in Excel,

00:00:56.760 --> 00:00:59.740
your data and your analysis
lives in one worksheet.

00:00:59.740 --> 00:01:03.250
So some of the cells have data,
and if you do calculations,

00:01:03.250 --> 00:01:05.810
then those calculations
go to different cells

00:01:05.810 --> 00:01:09.120
or maybe a different
sheet within the same file

00:01:09.120 --> 00:01:11.620
and then when inside cells in there.

00:01:11.620 --> 00:01:15.830
And also all calculation results
appear in the same cells.

00:01:15.830 --> 00:01:20.610
So you have data, analysis specification,

00:01:20.610 --> 00:01:22.930
and results in the same file.

00:01:22.930 --> 00:01:25.570
And it's not very easy for anyone

00:01:25.570 --> 00:01:28.460
who has not done the sheet themselves,

00:01:28.460 --> 00:01:32.400
or to understand what
is the logical sequence

00:01:32.400 --> 00:01:34.090
behind the analysis.

00:01:34.090 --> 00:01:35.630
So if you calculate the mean

00:01:35.630 --> 00:01:37.780
and then you calculate
the standard deviation,

00:01:37.780 --> 00:01:40.720
it's not clear by looking
at the Excel sheet,

00:01:40.720 --> 00:01:42.710
which one is calculated first.

00:01:42.710 --> 00:01:45.623
Some cases it doesn't
matter, some cases it does.

00:01:46.860 --> 00:01:49.730
Statistical software is
a different kind of tool.

00:01:49.730 --> 00:01:52.970
So statistical software has data,

00:01:52.970 --> 00:01:56.060
it has analysis specification
and it has results.

00:01:56.060 --> 00:01:58.720
But these are typically
in three separate files.

00:01:58.720 --> 00:02:02.330
And the data file is
something that you hardly,

00:02:02.330 --> 00:02:03.240
you never edit.

00:02:03.240 --> 00:02:05.537
So your data file use what you have,

00:02:05.537 --> 00:02:08.160
and that's, you never edit it.

00:02:08.160 --> 00:02:12.150
Then the analysis file lists
the sequence of operations

00:02:12.150 --> 00:02:15.540
or commands or analysis that
you applied to the data.

00:02:15.540 --> 00:02:19.110
And it's basically a
text file or a document,

00:02:19.110 --> 00:02:21.020
and you read it from bottom to down

00:02:21.020 --> 00:02:24.193
and then the computer executes things or,

00:02:25.230 --> 00:02:28.220
to the data using that sequence.

00:02:28.220 --> 00:02:30.830
So data analysis uses statistical software

00:02:30.830 --> 00:02:32.490
is command driven.

00:02:32.490 --> 00:02:35.600
And commands can do analysis,
they can manipulate data,

00:02:35.600 --> 00:02:36.990
they can load data sets, so it save data sets,

00:02:36.990 --> 00:02:39.170
so it do all kinds of things.

00:02:39.170 --> 00:02:43.230
All of these programs are command driven.

00:02:43.230 --> 00:02:46.230
R is a bit different because it's not

00:02:46.230 --> 00:02:50.670
as smart a statistical analysis
software as Stata and SPSS.

00:02:50.670 --> 00:02:53.790
Instead it's a statistical
programming environment.

00:02:53.790 --> 00:02:58.790
So it's much more focused on
programming than Stata and SPSS

00:02:59.200 --> 00:03:00.610
which are more focused

00:03:00.610 --> 00:03:03.680
on just a sequence reapplying
commands to the data.

00:03:03.680 --> 00:03:06.150
Of course, you can do that as well with R,

00:03:06.150 --> 00:03:07.950
but R is a much more general system.

00:03:08.860 --> 00:03:11.010
These have also different
target audiences.

00:03:11.890 --> 00:03:14.520
SPSS is owned by IBM and they are,

00:03:14.520 --> 00:03:16.800
one of their main markets is corporations.

00:03:16.800 --> 00:03:18.860
So they want to target
marketing departments

00:03:18.860 --> 00:03:20.900
and they have analysis techniques

00:03:20.900 --> 00:03:22.440
that are relevant for marketing,

00:03:22.440 --> 00:03:25.510
like customer segmentation
analysis and things like that

00:03:25.510 --> 00:03:27.920
that are not relevant for
social science reasons.

00:03:27.920 --> 00:03:32.270
Then Stata has been
developed first by a person

00:03:32.270 --> 00:03:33.760
with a background in university

00:03:33.760 --> 00:03:36.630
and is focused on social sciences.

00:03:36.630 --> 00:03:38.320
So it's focused on social sciences

00:03:38.320 --> 00:03:40.400
and nowadays life
science research as well.

00:03:40.400 --> 00:03:41.950
But this is specifically designed

00:03:41.950 --> 00:03:43.800
for university researchers.

00:03:43.800 --> 00:03:45.900
And R is a programming environment

00:03:45.900 --> 00:03:47.730
so it's designed to be very general

00:03:47.730 --> 00:03:49.733
without any specific target audience.

00:03:50.620 --> 00:03:55.270
What this difference
means that are with R,

00:03:55.270 --> 00:03:56.810
you can do the most things,

00:03:56.810 --> 00:04:01.300
but R because it's general
instead of focused specifically

00:04:01.300 --> 00:04:02.730
on certain tasks,

00:04:02.730 --> 00:04:05.150
it may not be the most easiest to use tool

00:04:05.150 --> 00:04:07.500
or the most efficient
tool for doing something.

00:04:07.500 --> 00:04:09.210
Then Stata has a more narrow scope

00:04:09.210 --> 00:04:13.580
and it's very good at
social science research.

00:04:13.580 --> 00:04:16.070
So most of the things that
social science researcher wants

00:04:16.070 --> 00:04:17.740
to do Stata provides,

00:04:17.740 --> 00:04:21.730
and it's a fairly nice to
use tool for that purpose.

00:04:21.730 --> 00:04:23.850
SPSS because of its focus,

00:04:23.850 --> 00:04:27.030
then lacks some of the tools
that we apply on the course.

00:04:27.030 --> 00:04:29.440
So because it's not focused

00:04:29.440 --> 00:04:31.465
on the kind of research that we do,

00:04:31.465 --> 00:04:35.280
then you need to go through
some extra steps to get

00:04:35.280 --> 00:04:36.750
some basic results.

00:04:36.750 --> 00:04:39.377
So it may be good for market segmentation,

00:04:39.377 --> 00:04:41.530
but it's not as easy to use as Stata.

00:04:41.530 --> 00:04:43.270
Once you know how to use it.

00:04:43.270 --> 00:04:44.900
The documentation is also quite different.

00:04:44.900 --> 00:04:49.430
So SPSS documentation is
about how to use SPSS,

00:04:49.430 --> 00:04:52.480
so that's the normal
software documentation.

00:04:52.480 --> 00:04:53.313
It's not,

00:04:53.313 --> 00:04:56.550
it doesn't try to get you to
understand regression analysis

00:04:56.550 --> 00:04:58.310
just tells you that if
you understand regression,

00:04:58.310 --> 00:05:00.570
this is how you do it with SPSS.

00:05:00.570 --> 00:05:02.220
Stata on the other hand,

00:05:02.220 --> 00:05:04.430
their documentation explains
the analysis as well.

00:05:04.430 --> 00:05:08.040
So this is a pretty good
learning resource as well.

00:05:08.040 --> 00:05:11.440
So whereas SPSS manual
tells you how to use SPSS,

00:05:11.440 --> 00:05:16.198
Stata tells you how certain
analysis are used and why,

00:05:16.198 --> 00:05:18.740
and how you get things done with Stata.

00:05:18.740 --> 00:05:22.080
Then R documentation is not
good for learning at all.

00:05:22.080 --> 00:05:24.904
Typically R documentation
tells you how a certain command

00:05:24.904 --> 00:05:28.910
is specified and then it may
point to an original source

00:05:28.910 --> 00:05:31.110
whoever first invented,

00:05:31.110 --> 00:05:33.670
let's say a regression
analysis and tell the reader,

00:05:33.670 --> 00:05:35.830
tell the user to look at the details

00:05:35.830 --> 00:05:38.080
of regression analysis
from the original source.

00:05:38.080 --> 00:05:41.000
So this is a less
user-friendly documentation.

00:05:41.000 --> 00:05:44.120
The availability of this
software differs as well.

00:05:44.120 --> 00:05:48.450
Most universities that
I've worked with have SPSS,

00:05:48.450 --> 00:05:50.700
have a site license for SPSS.

00:05:50.700 --> 00:05:52.520
Which means that the SPSS is installed

00:05:52.520 --> 00:05:54.490
on all university computers,

00:05:54.490 --> 00:05:57.260
and typically university
also provides a way

00:05:57.260 --> 00:06:00.920
of students and staff to install SPSS

00:06:00.920 --> 00:06:02.070
on their home computer.

00:06:02.950 --> 00:06:04.300
Stata on the other hand,

00:06:04.300 --> 00:06:06.010
doesn't have sets of licensing agreements.

00:06:06.010 --> 00:06:09.590
So Stata usually is
installed in a computer lab,

00:06:09.590 --> 00:06:12.210
and if your university has
a purchasing agreement,

00:06:12.210 --> 00:06:14.920
then typically it's fairly easy to get it

00:06:14.920 --> 00:06:16.400
on your work computer,

00:06:16.400 --> 00:06:19.020
but not, probably not
for your home computer.

00:06:19.020 --> 00:06:21.020
R on the other hand is open-source

00:06:21.020 --> 00:06:24.000
and typically installed on
all university computers,

00:06:24.000 --> 00:06:24.833
and it's free.

00:06:24.833 --> 00:06:28.130
You can just download R
and the RStudio editor,

00:06:28.130 --> 00:06:31.367
which it's highly recommended
on your own computer,

00:06:31.367 --> 00:06:33.690
and then just using it
because it doesn't cost you,

00:06:33.690 --> 00:06:36.653
there's no cost attached.

00:06:38.040 --> 00:06:40.180
There are a couple of different ways

00:06:40.180 --> 00:06:41.580
how you use the software

00:06:41.580 --> 00:06:45.980
and the different ways of
use partially determine

00:06:45.980 --> 00:06:47.530
which software is best for you.

00:06:48.540 --> 00:06:52.210
The data sets and commands
are two separate things

00:06:52.210 --> 00:06:56.350
in a statistical analysis
software as I explained,

00:06:56.350 --> 00:06:57.440
so the data file,

00:06:57.440 --> 00:07:00.340
that's whatever you got from
your data collection efforts,

00:07:00.340 --> 00:07:04.090
you never edit that,
it's columns and rows,

00:07:04.090 --> 00:07:05.700
and if you can,

00:07:05.700 --> 00:07:08.750
some software allow you to
have multiple data sets open,

00:07:08.750 --> 00:07:10.380
some software don't,

00:07:10.380 --> 00:07:13.730
it could be viewed as an
advantage to have multiple files

00:07:13.730 --> 00:07:15.270
or multiple data files open

00:07:15.270 --> 00:07:17.840
but then the problem is that
when you execute a command,

00:07:17.840 --> 00:07:20.890
how do you know which data set
you're actually working on?

00:07:20.890 --> 00:07:22.610
So with SPSS,

00:07:22.610 --> 00:07:25.860
I have multiple students
who are really confused

00:07:25.860 --> 00:07:27.510
that they apply an analysis,

00:07:27.510 --> 00:07:29.500
and the analysis result is unexpected.

00:07:29.500 --> 00:07:31.520
The reason why it's unexpected

00:07:31.520 --> 00:07:33.680
is that they have two data files open.

00:07:33.680 --> 00:07:36.230
They thought that they were
analyzing the first data file,

00:07:36.230 --> 00:07:38.680
but SPSS were actually
analyzing the second data file.

00:07:40.530 --> 00:07:41.720
Then we have command files,

00:07:41.720 --> 00:07:45.210
which are a sequence of data
manipulation analysis command,

00:07:45.210 --> 00:07:47.550
and these store the
logic of your analysis.

00:07:47.550 --> 00:07:50.753
However you want to use
your statistical software

00:07:50.753 --> 00:07:55.460
you should always at the
end have an analysis file.

00:07:55.460 --> 00:07:57.530
If you have a graphical user interface,

00:07:57.530 --> 00:07:59.730
which Stata and SPSS have,

00:07:59.730 --> 00:08:01.790
then those software, when you use them,

00:08:01.790 --> 00:08:03.990
they will produce a log file

00:08:03.990 --> 00:08:05.850
that contains all the analysis commands

00:08:05.850 --> 00:08:08.490
that you applied in that analysis session.

00:08:08.490 --> 00:08:12.450
When you are done at the end,
then you save the log file,

00:08:12.450 --> 00:08:14.310
you extract the commands,

00:08:14.310 --> 00:08:16.490
you take out those that
you don't actually need

00:08:16.490 --> 00:08:20.940
for the final paper and then
when you write your paper,

00:08:20.940 --> 00:08:22.960
you store the analysis file.

00:08:22.960 --> 00:08:25.220
This is important because
you need to be able

00:08:25.220 --> 00:08:27.120
to replicate your analysis later.

00:08:27.120 --> 00:08:30.200
If someone asks you, how
did you get your results?

00:08:30.200 --> 00:08:32.340
Unless you have the analysis file,

00:08:32.340 --> 00:08:34.340
then you can't repeat your analysis.

00:08:34.340 --> 00:08:38.380
If a reviewer wants to have
changes in your analysis,

00:08:38.380 --> 00:08:40.030
when you sub into a journal,

00:08:40.030 --> 00:08:43.160
then how are you supposed to do that

00:08:43.160 --> 00:08:44.480
if you have not kept track

00:08:44.480 --> 00:08:46.630
of what you actually did for the data.

00:08:46.630 --> 00:08:47.610
On this course,

00:08:47.610 --> 00:08:49.650
whenever you return a
data analysis assignment,

00:08:49.650 --> 00:08:52.750
you must return a report and
an analysis file as well.

00:08:52.750 --> 00:08:55.220
And this is very important
because,

00:08:55.220 --> 00:08:59.830
I can point you to many examples
where researchers clearly

00:08:59.830 --> 00:09:01.690
have not stored their results

00:09:01.690 --> 00:09:03.600
and when you ask them about the results,

00:09:03.600 --> 00:09:05.560
they have no idea how
they did the calculation,

00:09:05.560 --> 00:09:07.700
because they could have done it a year ago

00:09:07.700 --> 00:09:09.170
and then they have forgotten.

00:09:09.170 --> 00:09:10.800
Doing an analysis further ensures

00:09:10.800 --> 00:09:13.710
that you can always tell
a person who wants to know

00:09:13.710 --> 00:09:15.863
about the research,
how exactly you did it.

00:09:17.160 --> 00:09:20.890
So analysis file is one
way of storing the sequence

00:09:20.890 --> 00:09:21.723
of analysis,

00:09:21.723 --> 00:09:24.250
but there are basically
are three different ways

00:09:24.250 --> 00:09:25.130
of using this software.

00:09:25.130 --> 00:09:26.734
So we have first menus,

00:09:26.734 --> 00:09:30.700
so you can generate or
do commands using menus.

00:09:30.700 --> 00:09:33.440
So you point and click and you choose

00:09:33.440 --> 00:09:35.430
from the menu regression analysis,

00:09:35.430 --> 00:09:36.760
then you have a list of variables,

00:09:36.760 --> 00:09:38.530
you choose one to be the dependent,

00:09:38.530 --> 00:09:40.410
a couple to be the independence,

00:09:40.410 --> 00:09:41.990
and then you run the execute.

00:09:41.990 --> 00:09:44.180
Then the user interface of the software

00:09:44.180 --> 00:09:45.700
will generate the command,

00:09:45.700 --> 00:09:48.903
which the software will
then produce or run.

00:09:50.100 --> 00:09:51.640
R doesn't have menus,

00:09:51.640 --> 00:09:55.290
which makes using R a bit difficult,

00:09:55.290 --> 00:09:57.010
at least for the very beginners,

00:09:57.010 --> 00:09:59.790
because you need to learn
how the commands are typed

00:09:59.790 --> 00:10:00.790
in the very beginning.

00:10:00.790 --> 00:10:03.670
So R has a steep learning
curve because of that.

00:10:03.670 --> 00:10:05.090
When you open it the first time,

00:10:05.090 --> 00:10:07.070
you may have no idea what to do.

00:10:07.070 --> 00:10:09.470
When you open SPSS or Stata the first time

00:10:09.470 --> 00:10:11.740
you can always see that
there's analysis menu,

00:10:11.740 --> 00:10:13.580
perhaps clicking on the analysis menu

00:10:13.580 --> 00:10:16.350
I can do some analysis and then
there's regression analysis,

00:10:16.350 --> 00:10:19.040
perhaps clicking on that, you
can do a regression analysis,

00:10:19.040 --> 00:10:21.240
and indeed that's the
way you do regression.

00:10:22.520 --> 00:10:27.380
Stata and R also allow you to
type commands interactively.

00:10:27.380 --> 00:10:31.540
So you can type commands
and this is the way

00:10:31.540 --> 00:10:34.760
most professional researcher
that I know use their software.

00:10:34.760 --> 00:10:37.470
Once you know the basic commands

00:10:37.470 --> 00:10:41.640
it's a lot easier to type,
it's a regressing or R

00:10:41.640 --> 00:10:45.050
or REG, short for regression analysis,

00:10:45.050 --> 00:10:47.530
and then type the names
of variables instead of go

00:10:47.530 --> 00:10:49.490
and click through the user interface.

00:10:49.490 --> 00:10:52.160
So you're a lot quicker with keyboard

00:10:52.160 --> 00:10:54.010
than you are with the menus.

00:10:54.010 --> 00:10:57.590
So this is something that, for example,

00:10:57.590 --> 00:10:58.680
Stata documentation recommends

00:10:58.680 --> 00:11:00.955
that you should start learning

00:11:00.955 --> 00:11:02.300
and that's the first thing,

00:11:02.300 --> 00:11:03.310
but that's the second thing

00:11:03.310 --> 00:11:05.070
when you start using the software.

00:11:05.070 --> 00:11:06.610
And then you have the analysis file.

00:11:06.610 --> 00:11:08.360
So the analysis file is just the,

00:11:08.360 --> 00:11:11.020
a sequence of commands that
reproduces all your analysis.

00:11:11.020 --> 00:11:13.810
Every time when you think
that you did something stupid,

00:11:13.810 --> 00:11:15.207
they'd rerun your analysis file

00:11:15.207 --> 00:11:18.570
and that gives you a clean
slate of the final analysis.

00:11:18.570 --> 00:11:20.120
That's how I use this software.

00:11:21.040 --> 00:11:23.550
It also, when we discuss,

00:11:23.550 --> 00:11:25.261
which of this software is the best.

00:11:25.261 --> 00:11:27.670
One thing that you need to consider

00:11:27.670 --> 00:11:29.500
is the capabilities of the software

00:11:29.500 --> 00:11:32.430
and what does the analysis file look like

00:11:32.430 --> 00:11:34.320
because you always have
to produce that at least,

00:11:34.320 --> 00:11:37.340
regardless of how you, whether
you use menus or typing,

00:11:37.340 --> 00:11:40.930
the analysis file is something
that you will always have.

00:11:40.930 --> 00:11:42.963
So here's an analysis file example,

00:11:43.920 --> 00:11:48.920
doing the same analysis set of
analysis in Stata and the R,

00:11:49.340 --> 00:11:51.590
you don't have to understand
what this means now,

00:11:51.590 --> 00:11:52.900
but it basically,

00:11:52.900 --> 00:11:56.200
what I'm doing here is that
this is a regression analysis,

00:11:56.200 --> 00:11:58.890
so we have the regress command here

00:11:58.890 --> 00:12:01.800
or LM command here for linear model

00:12:01.800 --> 00:12:05.720
and I have a data set about professions

00:12:05.720 --> 00:12:09.010
I'm explaining the logarithm of income,

00:12:09.010 --> 00:12:13.280
and I'm having an interaction
term with Prestige Women,

00:12:13.280 --> 00:12:15.620
I have a categorical variable here,

00:12:15.620 --> 00:12:17.740
and so this is the regression analysis.

00:12:17.740 --> 00:12:19.880
So in Stata,

00:12:19.880 --> 00:12:21.410
we create a log of income,

00:12:21.410 --> 00:12:24.560
Stata will automatically get
an interaction turn for us,

00:12:24.560 --> 00:12:27.940
it'll automatically do
categorical variables

00:12:27.940 --> 00:12:30.850
if we indicate them with the i prefix,

00:12:30.850 --> 00:12:32.960
R will automatically treat this type

00:12:32.960 --> 00:12:34.740
as a categorical variable,

00:12:34.740 --> 00:12:38.080
and then we have this regression here,

00:12:38.080 --> 00:12:41.270
interactions that will always
multiply things together,

00:12:41.270 --> 00:12:44.440
R knows how to deal with
that same with Stata.

00:12:44.440 --> 00:12:49.200
Then we have our marginal
predictions calculated here,

00:12:49.200 --> 00:12:51.350
and that's something that I will discuss

00:12:51.350 --> 00:12:53.379
on the course quite a lot
because it's highly useful

00:12:53.379 --> 00:12:55.460
and under utilized tool.

00:12:55.460 --> 00:12:58.760
And then we plot the marginal predictions.

00:12:58.760 --> 00:13:03.760
So this is maybe a one,
two, three commands

00:13:04.260 --> 00:13:07.253
to do a transformation of one variable,

00:13:07.253 --> 00:13:08.890
regression analysis,

00:13:08.890 --> 00:13:13.110
and then plotting the result
using marginal predictions.

00:13:13.110 --> 00:13:16.330
In R, we need to have a load of package

00:13:16.330 --> 00:13:18.040
for the marginal prediction plot,

00:13:18.040 --> 00:13:21.890
we have two, three, four, five,

00:13:21.890 --> 00:13:25.190
six commands out of which
one is loading a package,

00:13:25.190 --> 00:13:27.360
then two are just
printing out the results,

00:13:27.360 --> 00:13:29.110
the summary commands.

00:13:29.110 --> 00:13:31.680
So you only try a small number of commands

00:13:31.680 --> 00:13:34.733
for a fairly impressive set of things.

00:13:36.900 --> 00:13:38.790
In SPSS,

00:13:38.790 --> 00:13:40.790
this is the regression part.

00:13:40.790 --> 00:13:43.500
So there's no marginal
predictions, there's no plotting.

00:13:43.500 --> 00:13:45.450
You can't do that with SPSS.

00:13:45.450 --> 00:13:48.070
So this will, with SPSS,

00:13:48.070 --> 00:13:51.200
SPSS doesn't know how to
deal with interaction terms,

00:13:51.200 --> 00:13:53.500
it doesn't know how to deal
with categorical variables

00:13:53.500 --> 00:13:54.960
in a regression analysis.

00:13:54.960 --> 00:13:58.400
So you have to dummy code manually.

00:13:58.400 --> 00:14:00.350
So doing this,

00:14:00.350 --> 00:14:03.040
if you can type that's fairly quick to do,

00:14:03.040 --> 00:14:05.710
if you do this in the
user interface, this plot,

00:14:05.710 --> 00:14:07.560
maybe it takes you 10 minutes to do,

00:14:07.560 --> 00:14:09.830
compared to just typing the variable name

00:14:09.830 --> 00:14:12.320
and allowing R to do it
automatically for you

00:14:12.320 --> 00:14:14.500
or typing i period variable name

00:14:14.500 --> 00:14:17.280
and allowing the Stata to
automatically do it for you

00:14:17.280 --> 00:14:18.800
once Stata,

00:14:18.800 --> 00:14:21.560
once you tell Stata that this
is a categorical variable.

00:14:21.560 --> 00:14:24.753
So in SPSS, there is,

00:14:26.240 --> 00:14:28.360
you need to do a lot
more data manipulation

00:14:28.360 --> 00:14:29.830
before the analysis,

00:14:29.830 --> 00:14:32.950
because the analysis command
is actually, it's less capable.

00:14:32.950 --> 00:14:36.310
Also, if the regression
command is fairly involved,

00:14:36.310 --> 00:14:38.160
you need to specify lots of things.

00:14:38.160 --> 00:14:41.050
It's not enough to specify
just the dependent variable

00:14:41.050 --> 00:14:42.650
and the independent variables,

00:14:42.650 --> 00:14:45.180
but you need to specify
all kinds of defaults,

00:14:45.180 --> 00:14:48.360
because for some reason
the command doesn't work

00:14:48.360 --> 00:14:52.240
with empty defaults and default
to some useful settings.

00:14:52.240 --> 00:14:54.900
And then once we have done
the regression analysis

00:14:54.900 --> 00:14:57.890
then you will need to copy
paste the results to Excel,

00:14:57.890 --> 00:15:00.377
to do the marginal predictions

00:15:00.377 --> 00:15:02.280
and to plot off the marginal predictions.

00:15:02.280 --> 00:15:03.720
So, SPSS here,

00:15:03.720 --> 00:15:06.340
there's a lot of, it's more work.

00:15:06.340 --> 00:15:09.100
It's more stuff going
in the analysis file,

00:15:09.100 --> 00:15:10.470
and it does less than the R.

00:15:10.470 --> 00:15:13.600
That's about the half of
what these analysis files do.

00:15:13.600 --> 00:15:16.780
So which one do you think
is the most convenient

00:15:16.780 --> 00:15:18.030
to work with in the long run?

00:15:18.030 --> 00:15:19.800
Well, that's a personal preference.

00:15:19.800 --> 00:15:24.530
Some people can get
away with never editing

00:15:24.530 --> 00:15:27.430
their analysis files by hand.

00:15:27.430 --> 00:15:28.280
So instead of,

00:15:28.280 --> 00:15:30.480
they just do a command and then they take

00:15:30.480 --> 00:15:32.330
what the command is using the menus

00:15:32.330 --> 00:15:35.600
and then they copy paste
it to the analysis file.

00:15:35.600 --> 00:15:36.490
But for example,

00:15:36.490 --> 00:15:38.860
if you need to change how you code

00:15:38.860 --> 00:15:41.240
this categorical
variable, at least for me,

00:15:41.240 --> 00:15:44.890
it's a lot simpler to
just edit this syntax here

00:15:44.890 --> 00:15:48.310
instead of going and
pointing and clicking around.

00:15:48.310 --> 00:15:51.290
So the SPSS syntax
it's not as user-friendly

00:15:51.290 --> 00:15:53.080
as a Stata and R,

00:15:53.080 --> 00:15:54.940
but if you do not understand

00:15:54.940 --> 00:15:57.270
any of these software's basic syntaxes

00:15:57.270 --> 00:15:59.220
it's going to be fairly
impossible to know what these,

00:15:59.220 --> 00:16:00.800
what any of these does.

00:16:00.800 --> 00:16:02.270
But it's a Stata, less typing here,

00:16:02.270 --> 00:16:05.010
regress and then dependent variable,

00:16:05.010 --> 00:16:06.680
independent variable, same here.

00:16:06.680 --> 00:16:10.460
LM are dependent variable,
independent variables,

00:16:10.460 --> 00:16:12.363
compared to this specific SM here.

00:16:13.360 --> 00:16:15.150
So my take on software is pretty clear.

00:16:15.150 --> 00:16:17.330
I don't think anyone should be using SPSS

00:16:17.330 --> 00:16:19.130
for serious research.

00:16:19.130 --> 00:16:24.130
If you want to be a professional,
a construction worker,

00:16:24.210 --> 00:16:27.870
you don't go to the closest
store and pick the cheapest drill,

00:16:27.870 --> 00:16:29.110
you go to a hardware store

00:16:29.110 --> 00:16:33.130
and pick a proper professional drill.

00:16:33.130 --> 00:16:34.160
That's the same thing here.

00:16:34.160 --> 00:16:36.470
We have different kinds of tools.

00:16:36.470 --> 00:16:38.020
SPSS is a good,

00:16:38.020 --> 00:16:40.060
it's a very good tool for getting started.

00:16:40.060 --> 00:16:43.060
So if you just want to
do the first assignment

00:16:43.060 --> 00:16:43.893
of this course

00:16:43.893 --> 00:16:46.750
and never do any quantitative
research yourself,

00:16:46.750 --> 00:16:49.020
you're gonna be fine with SPSS.

00:16:49.020 --> 00:16:51.510
If you want to do this for a living,

00:16:51.510 --> 00:16:55.050
then Stata is probably
better choice for you.

00:16:55.050 --> 00:16:58.440
The R is also something
that you could consider,

00:16:58.440 --> 00:17:00.930
but the problem is that
R is a bit technical.

00:17:00.930 --> 00:17:03.360
So if you are a very non-technical person

00:17:03.360 --> 00:17:06.720
then R may not be the right tool for you.

00:17:06.720 --> 00:17:09.730
There also are some good
reasons to use SPSS.

00:17:09.730 --> 00:17:13.940
So there are lots of very
successful researchers

00:17:13.940 --> 00:17:16.070
who use SPSS as their main tool.

00:17:16.070 --> 00:17:19.050
Their main competence is
probably something else

00:17:19.050 --> 00:17:20.380
than data analysis.

00:17:20.380 --> 00:17:22.410
So if you specialize in theory,

00:17:22.410 --> 00:17:25.090
you just need basic tools
for testing your theory,

00:17:25.090 --> 00:17:27.110
and then you have others

00:17:27.110 --> 00:17:28.850
who do the more advanced tests for you,

00:17:28.850 --> 00:17:30.540
you're gonna be fine with SPSS,

00:17:30.540 --> 00:17:35.520
but if you want to be very
good in statistical analysis

00:17:35.520 --> 00:17:36.710
and quantitative research,

00:17:36.710 --> 00:17:40.738
then SPSS is probably going to
be in your way at some point.

00:17:40.738 --> 00:17:43.160
I know quite a few people

00:17:43.160 --> 00:17:45.600
that have used SPSS in the past

00:17:45.600 --> 00:17:47.960
and have moved to Stata since,

00:17:47.960 --> 00:17:50.760
and I don't know anyone who has applied,

00:17:50.760 --> 00:17:54.290
used Stata as their main
tool and then moved to SPSS.

00:17:54.290 --> 00:17:57.260
There are some people
who move from SPSS to R

00:17:57.260 --> 00:17:59.320
but that's pretty big leap
because the software

00:17:59.320 --> 00:18:00.560
are so different.

00:18:00.560 --> 00:18:03.522
That being said, the
use of R is increasing.

00:18:03.522 --> 00:18:05.163
On the courses that I give,

00:18:06.140 --> 00:18:09.150
R is tends to be the most popular option,

00:18:09.150 --> 00:18:11.630
because you can install R anywhere

00:18:11.630 --> 00:18:13.400
and it's always available for you.

00:18:13.400 --> 00:18:16.410
Stata comes next and then perhaps SPSS.

00:18:16.410 --> 00:18:19.380
You're gonna be fine with
SPSS, but it does not,

00:18:19.380 --> 00:18:20.213
for this course,

00:18:20.213 --> 00:18:23.793
but it's just not an ideal
tool in the long run for you.

00:18:25.260 --> 00:18:26.730
So how do you get started?

00:18:26.730 --> 00:18:28.590
First you need to familiarize
with the software.

00:18:28.590 --> 00:18:30.530
So you need to have an
understanding of their,

00:18:30.530 --> 00:18:33.140
of the basic feeling of
how the software looks,

00:18:33.140 --> 00:18:34.750
how it works.

00:18:34.750 --> 00:18:36.340
There are, there's first Stata,

00:18:36.340 --> 00:18:38.660
the Stata's introductory manual

00:18:38.660 --> 00:18:41.070
is a very good getting started.

00:18:41.070 --> 00:18:43.577
So go and do the Stata's,

00:18:43.577 --> 00:18:45.510
"Introducing Stata Sample Session,"

00:18:45.510 --> 00:18:48.950
open Stata, go to help
menu, go to getting started,

00:18:48.950 --> 00:18:49.790
start working.

00:18:49.790 --> 00:18:50.740
They have,

00:18:50.740 --> 00:18:53.920
they explain how the software
can be used by typing commands

00:18:53.920 --> 00:18:56.640
by doing things from the menus.

00:18:56.640 --> 00:19:00.069
If you want to use SPSS, I
recommend that you do the same.

00:19:00.069 --> 00:19:04.380
There is a manual that you
can access from the menus

00:19:04.380 --> 00:19:06.340
and then work through chapters
one, four, seven, eight,

00:19:06.340 --> 00:19:07.560
and nine in that manual.

00:19:07.560 --> 00:19:10.190
Those are the ones that
I have for my teaching.

00:19:10.190 --> 00:19:12.860
If you want to use R
then I would recommend

00:19:12.860 --> 00:19:13.977
that you go through this,

00:19:13.977 --> 00:19:15.980
"Learn To Use R" from Computer World.

00:19:15.980 --> 00:19:17.450
And these are,

00:19:17.450 --> 00:19:18.920
these will give you roughly the idea

00:19:18.920 --> 00:19:20.700
of what these software are about.

00:19:20.700 --> 00:19:22.230
Then when you actually start learning

00:19:22.230 --> 00:19:23.270
how to use this software,

00:19:23.270 --> 00:19:25.020
then you need some other resources.

00:19:26.580 --> 00:19:28.583
I recommend some books for R,

00:19:30.660 --> 00:19:33.637
there's "R In Action"
and "R for Data Science."

00:19:33.637 --> 00:19:36.368
"R In Action" is a bit
more old fashioned R,

00:19:36.368 --> 00:19:40.130
and "R for Data Science" is
a more modern take for R.

00:19:40.130 --> 00:19:42.160
The problem with "R for Data Science"

00:19:42.160 --> 00:19:45.330
is that this book goes to...

00:19:46.281 --> 00:19:49.360
It gets to pretty advanced
stuff pretty quickly.

00:19:49.360 --> 00:19:51.157
So "R In Action" is more basic,

00:19:51.157 --> 00:19:52.350
"R for Data Science"

00:19:52.350 --> 00:19:54.630
is something that you should
definitely read at some point,

00:19:54.630 --> 00:19:57.173
if you want to be an efficient user of R.

00:19:58.340 --> 00:19:59.423
For SPSS,

00:20:00.450 --> 00:20:03.760
I recommend "Discovering
Statistics Using SPSS,"

00:20:03.760 --> 00:20:05.130
that's a pretty good book.

00:20:05.130 --> 00:20:08.020
The same person also has a book about R,

00:20:08.020 --> 00:20:11.330
if I can remember correctly,
and that may be a good book.

00:20:11.330 --> 00:20:12.910
I haven't read it myself.

00:20:12.910 --> 00:20:14.010
Then for Stata,

00:20:14.010 --> 00:20:16.680
I recommend that you start
reading the "Stata User Manual"

00:20:16.680 --> 00:20:19.543
because that "Stata User
Manual" is pretty excellent.

00:20:20.409 --> 00:20:22.670
Then search for online examples,

00:20:22.670 --> 00:20:24.500
there are lots of websites that tell you

00:20:24.500 --> 00:20:26.980
how to do certain analysis
in R, SPSS, and Stata,

00:20:26.980 --> 00:20:28.563
and then you can compare.

00:20:29.830 --> 00:20:31.650
Ask for help online.

00:20:31.650 --> 00:20:34.100
For example, this course
is Data Analysis Forum.

00:20:34.100 --> 00:20:35.950
If you have a problem ask there,

00:20:35.950 --> 00:20:37.720
come to the computer lab.

00:20:37.720 --> 00:20:39.530
And for R specifically,

00:20:39.530 --> 00:20:41.120
there are some really good online courses.

00:20:41.120 --> 00:20:41.953
For example,

00:20:41.953 --> 00:20:45.370
Data Camp has done interactive course,

00:20:45.370 --> 00:20:47.020
it takes you a couple of hours to do,

00:20:47.020 --> 00:20:48.920
it teaches you the basics of R,

00:20:48.920 --> 00:20:51.520
so you use R in a web browser there,

00:20:51.520 --> 00:20:52.900
the course tells you what to do

00:20:52.900 --> 00:20:54.710
then you do it after you succeed,

00:20:54.710 --> 00:20:56.360
then it tells you the next thing.

00:20:57.510 --> 00:20:58.840
One of my favorite resources

00:20:58.840 --> 00:21:01.670
for learning how to get things
done with these software

00:21:01.670 --> 00:21:04.550
is the University of
California, Los Angeles,

00:21:04.550 --> 00:21:07.360
data analysis examples website.

00:21:07.360 --> 00:21:08.367
So they are the,

00:21:09.699 --> 00:21:14.100
the link is here, and this
is an excellent source for,

00:21:14.100 --> 00:21:17.410
because you can compare how
certain things are accomplished

00:21:17.410 --> 00:21:18.610
with different software.