WEBVTT
Kind: captions
Language: en
00:00:00.030 --> 00:00:04.500
Logistical regression analysis is commonly
used tool for binary dependent variables.
00:00:04.500 --> 00:00:10.170
A binary variable is a variable that receives
the values of 1 and 0 and it's very commonly
00:00:10.170 --> 00:00:14.640
used for decisions that are either yes
or no whether something happens or not.
00:00:14.640 --> 00:00:20.370
Whether a company decides to expand
internationally or whether it decide
00:00:20.370 --> 00:00:24.390
to stay in the whole markets, whether a
person is sick or not and that kind of data.
00:00:24.390 --> 00:00:30.540
To illustrate their losses regression
analysis technique we need to have some
00:00:30.540 --> 00:00:34.050
example data and this example
data are girls from Warsaw.
00:00:34.050 --> 00:00:40.710
And the girls range from about 10 years to
about 18 years and the dependent variable
00:00:40.710 --> 00:00:45.360
here is called min arts and that's whether
the girl has had the first period or not.
00:00:45.360 --> 00:00:51.570
So we can see here that girls at the age of
10 normally don't have had the first period,
00:00:51.570 --> 00:00:56.010
and then girls when they are 18 pretty
much everyone has had the first period.
00:00:56.010 --> 00:01:02.730
And we want to explain this relationship between
our age and menarche using regression analysis.
00:01:02.730 --> 00:01:08.130
There are a couple of problems when
we apply normal regression analysis.
00:01:08.130 --> 00:01:17.970
For this kind of data set the first problem
is that the regression line here goes over 1.
00:01:17.970 --> 00:01:23.730
So the value here, the regression
line gives the expected value of
00:01:23.730 --> 00:01:28.920
the dependent variable given age.
And in this case because the dependent
00:01:28.920 --> 00:01:36.360
variable is 0 and once the expected value is
the expected probability of having menarche.
00:01:36.360 --> 00:01:42.870
When we draw the line that we have a
problem here because the predictive
00:01:42.870 --> 00:01:48.210
probability for girls that are 18 exceeds
1, and probabilities bound between 1 and 0.
00:01:48.210 --> 00:01:55.320
Also we have negative probability here. T
This also causes a problem for regression
00:01:55.320 --> 00:02:02.340
analysis because when we have small numbers
small fitted values here, then all residuals
00:02:02.340 --> 00:02:09.150
are positives or they, the error term can't be
independent of the bow of the defeated value.
00:02:09.150 --> 00:02:13.510
So regression analysis we are violating
the noise energy assumption at least,
00:02:13.510 --> 00:02:20.440
and are the predictions don't make any sense.
So using a linear model for this kind of data
00:02:20.440 --> 00:02:25.990
is problematic for these two reasons.
Using this kind of linear model would
00:02:25.990 --> 00:02:32.470
be acceptable if most girls will be around
here, so the linear approximation would be
00:02:32.470 --> 00:02:37.660
okay because it doesn't really are predict any
negative values, because we can't go beyond
00:02:37.660 --> 00:02:42.910
the range of the data. But if we have negative
predictions on predictions that exceed one within
00:02:42.910 --> 00:02:48.280
the range of the data, then we have problems.
This model is called linear probability model
00:02:48.280 --> 00:02:53.410
and it's it can be used but there
are typically better alternatives.
00:02:53.410 --> 00:02:59.650
One better alternative is to start to start
discovering better alternatives we need to
00:02:59.650 --> 00:03:05.410
think about what's the relationship like and
we can do a nonparametric analysis, for example
00:03:05.410 --> 00:03:11.170
we take a rolling average from the data.
So the idea of rolling average is that we
00:03:11.170 --> 00:03:18.940
have here about 4,000 girls and then we take the
first 500 here we calculate the mean for these
00:03:18.940 --> 00:03:25.840
first 500 and then we put mark a small dot here.
The other is for these girls is zero because no
00:03:25.840 --> 00:03:33.610
one has at the menarche. Then we shift this window
right to a bit we check the next 500 girls so we
00:03:33.610 --> 00:03:40.900
go from the second girl to the 501st girl like
that we calculate the average, we mark it here.
00:03:40.900 --> 00:03:48.520
Then we go to the third girl to 500 second girl
and we calculate average for that sub sample.
00:03:48.520 --> 00:03:53.590
Then we continue we'll go here we can
see that the mean value is about 50%
00:03:53.590 --> 00:04:00.130
and our final when we calculate for all
possible windows, we calculate the mean.
00:04:00.130 --> 00:04:04.300
We get this kind of a non parametric
curve. It's nonparametric because we
00:04:04.300 --> 00:04:10.240
can't express this curve as a simple function.
We can see that this is an s-shaped curve.
00:04:10.240 --> 00:04:16.300
So first when girls get a little bit older
some girls start to have menarche but not
00:04:16.300 --> 00:04:22.090
many. And once you hit about 1314 then
the rate of having menarche increases
00:04:22.090 --> 00:04:27.790
rapidly until it starts to decrease when
you are about at about 15, when pretty much
00:04:27.790 --> 00:04:34.840
everyone has had menarche except for a couple
exceptions.a And then it flattens out at one.
00:04:34.840 --> 00:04:42.820
This curve is are called a logistic curve.
So here is the logistic curve and the idea
00:04:42.820 --> 00:04:47.800
of logistic regression analysis is that instead
of fitting a line we fit this logistic curve.
00:04:47.800 --> 00:04:52.540
The logit curve and the interpretation
of the result stays the same so the line
00:04:52.540 --> 00:04:58.840
gives us the expected probability of a girl
having had menarche given their age. But this
00:04:58.840 --> 00:05:03.340
line as we can as we saw from the previous
slide is a much better fit for the data.
00:05:03.340 --> 00:05:09.520
So the data the relationship is not linear
rather it follows an S shape and the logit
00:05:09.520 --> 00:05:13.990
curve is one such as safe care that we
could use and it's very commonly used.
00:05:13.990 --> 00:05:19.420
So we get the probability of having had
menarche given the age from the model.
00:05:19.420 --> 00:05:25.390
The model can be expressed mathematically
because all models are just equations and
00:05:25.390 --> 00:05:29.950
the mathematical expressions for this
logistic regression model is as follows.
00:05:29.950 --> 00:05:34.210
First you have the linear regression model.
So that's the linear probability model because
00:05:34.210 --> 00:05:40.720
we have one binary dependent variable and
the regression model extends the the logistic
00:05:40.720 --> 00:05:46.420
model extends the normal recursive model
by taking a function of this fitted value.
00:05:46.420 --> 00:05:51.250
So we calculate the linear prediction
using our the observed data and then
00:05:51.250 --> 00:05:58.030
we take a function here which gives
us the logit curve and the functions.
00:05:58.030 --> 00:06:02.290
The inverse of this function is called the
link function and that's the logit function.
00:06:02.290 --> 00:06:07.120
That this is the inverse whether our it's called
an inverse function or a function doesn't matter.
00:06:07.120 --> 00:06:11.770
The important thing for you to understand
is that the instead of using the predictions
00:06:11.770 --> 00:06:17.440
directly we apply a function that the
predictions that make the prediction sort
00:06:17.440 --> 00:06:25.060
transforms the predictions from a line
to a curve. Okay, so how do we estimate
00:06:25.060 --> 00:06:33.520
the model? We can apply OLS estimation. So we
apply OLS estimation, then we do Diagnostics.
00:06:33.520 --> 00:06:43.240
So we get the residuals here, there's a residual,
so we can calculate it then we can plot,
00:06:43.240 --> 00:06:47.800
residual versus Fida which is one of the
standard diagnostic plots and then we can
00:06:47.800 --> 00:06:53.380
check the normality of the residuals. We have
two violations of regression assumptions. First
00:06:53.380 --> 00:07:00.040
of all they are the residual is not normally
distributed, so but that's not really a big deal.
00:07:00.040 --> 00:07:06.430
It's only relevant in very small samples.
Then we have our heteroscedasticity problem,
00:07:06.430 --> 00:07:12.100
because the variation of the residuals
here is a lot higher than the variation
00:07:12.100 --> 00:07:16.900
here because the variance is the square
of the difference, square of the residual.
00:07:16.900 --> 00:07:24.190
Then our so we have our heteroscedasticity
problem. We are in violation of
00:07:24.190 --> 00:07:31.360
then MLR 5 and MLR 6 assumptions.
Whether that's a big deal or not we could
00:07:31.360 --> 00:07:36.880
use a robust and others but there are also some
computational difficulties when we try to apply
00:07:36.880 --> 00:07:43.630
least squares approach to this kind of problem.
And because of those computational difficulties
00:07:43.630 --> 00:07:47.620
and because OLS is not ideal anywhere
because of violation of these assumptions,
00:07:47.620 --> 00:07:54.280
we are estimate this using a different
approach called maximum likelihood estimation.