WEBVTT
Kind: captions
Language: en

00:00:00.030 --> 00:00:04.500
Logistical regression analysis is commonly 
used tool for binary dependent variables. 

00:00:04.500 --> 00:00:10.170
A binary variable is a variable that receives 
the values of 1 and 0 and it's very commonly  

00:00:10.170 --> 00:00:14.640
used for decisions that are either yes 
or no whether something happens or not. 

00:00:14.640 --> 00:00:20.370
Whether a company decides to expand 
internationally or whether it decide  

00:00:20.370 --> 00:00:24.390
to stay in the whole markets, whether a 
person is sick or not and that kind of data. 

00:00:24.390 --> 00:00:30.540
To illustrate their losses regression 
analysis technique we need to have some  

00:00:30.540 --> 00:00:34.050
example data and this example 
data are girls from Warsaw. 

00:00:34.050 --> 00:00:40.710
And the girls range from about 10 years to 
about 18 years and the dependent variable  

00:00:40.710 --> 00:00:45.360
here is called min arts and that's whether 
the girl has had the first period or not. 

00:00:45.360 --> 00:00:51.570
So we can see here that girls at the age of 
10 normally don't have had the first period,  

00:00:51.570 --> 00:00:56.010
and then girls when they are 18 pretty 
much everyone has had the first period. 

00:00:56.010 --> 00:01:02.730
And we want to explain this relationship between 
our age and menarche using regression analysis. 

00:01:02.730 --> 00:01:08.130
There are a couple of problems when 
we apply normal regression analysis. 

00:01:08.130 --> 00:01:17.970
For this kind of data set the first problem 
is that the regression line here goes over 1. 

00:01:17.970 --> 00:01:23.730
So the value here, the regression 
line gives the expected value of  

00:01:23.730 --> 00:01:28.920
the dependent variable given age.
And in this case because the dependent  

00:01:28.920 --> 00:01:36.360
variable is 0 and once the expected value is 
the expected probability of having menarche. 

00:01:36.360 --> 00:01:42.870
When we draw the line that we have a 
problem here because the predictive  

00:01:42.870 --> 00:01:48.210
probability for girls that are 18 exceeds 
1, and probabilities bound between 1 and 0. 

00:01:48.210 --> 00:01:55.320
Also we have negative probability here. T
This also causes a problem for regression  

00:01:55.320 --> 00:02:02.340
analysis because when we have small numbers 
small fitted values here, then all residuals  

00:02:02.340 --> 00:02:09.150
are positives or they, the error term can't be 
independent of the bow of the defeated value. 

00:02:09.150 --> 00:02:13.510
So regression analysis we are violating 
the noise energy assumption at least,  

00:02:13.510 --> 00:02:20.440
and are the predictions don't make any sense.
So using a linear model for this kind of data  

00:02:20.440 --> 00:02:25.990
is problematic for these two reasons. 
Using this kind of linear model would  

00:02:25.990 --> 00:02:32.470
be acceptable if most girls will be around 
here, so the linear approximation would be  

00:02:32.470 --> 00:02:37.660
okay because it doesn't really are predict any 
negative values, because we can't go beyond  

00:02:37.660 --> 00:02:42.910
the range of the data. But if we have negative 
predictions on predictions that exceed one within  

00:02:42.910 --> 00:02:48.280
the range of the data, then we have problems.
This model is called linear probability model  

00:02:48.280 --> 00:02:53.410
and it's it can be used but there 
are typically better alternatives. 

00:02:53.410 --> 00:02:59.650
One better alternative is to start to start 
discovering better alternatives we need to  

00:02:59.650 --> 00:03:05.410
think about what's the relationship like and 
we can do a nonparametric analysis, for example  

00:03:05.410 --> 00:03:11.170
we take a rolling average from the data.
So the idea of rolling average is that we  

00:03:11.170 --> 00:03:18.940
have here about 4,000 girls and then we take the 
first 500 here we calculate the mean for these  

00:03:18.940 --> 00:03:25.840
first 500 and then we put mark a small dot here.
The other is for these girls is zero because no  

00:03:25.840 --> 00:03:33.610
one has at the menarche. Then we shift this window 
right to a bit we check the next 500 girls so we  

00:03:33.610 --> 00:03:40.900
go from the second girl to the 501st girl like 
that we calculate the average, we mark it here. 

00:03:40.900 --> 00:03:48.520
Then we go to the third girl to 500 second girl 
and we calculate average for that sub sample. 

00:03:48.520 --> 00:03:53.590
Then we continue we'll go here we can 
see that the mean value is about 50%  

00:03:53.590 --> 00:04:00.130
and our final when we calculate for all 
possible windows, we calculate the mean. 

00:04:00.130 --> 00:04:04.300
We get this kind of a non parametric 
curve. It's nonparametric because we  

00:04:04.300 --> 00:04:10.240
can't express this curve as a simple function.
We can see that this is an s-shaped curve. 

00:04:10.240 --> 00:04:16.300
So first when girls get a little bit older 
some girls start to have menarche but not  

00:04:16.300 --> 00:04:22.090
many. And once you hit about 1314 then 
the rate of having menarche increases  

00:04:22.090 --> 00:04:27.790
rapidly until it starts to decrease when 
you are about at about 15, when pretty much  

00:04:27.790 --> 00:04:34.840
everyone has had menarche except for a couple 
exceptions.a And then it flattens out at one. 

00:04:34.840 --> 00:04:42.820
This curve is are called a logistic curve.
So here is the logistic curve and the idea  

00:04:42.820 --> 00:04:47.800
of logistic regression analysis is that instead 
of fitting a line we fit this logistic curve.  

00:04:47.800 --> 00:04:52.540
The logit curve and the interpretation 
of the result stays the same so the line  

00:04:52.540 --> 00:04:58.840
gives us the expected probability of a girl 
having had menarche given their age. But this  

00:04:58.840 --> 00:05:03.340
line as we can as we saw from the previous 
slide is a much better fit for the data. 

00:05:03.340 --> 00:05:09.520
So the data the relationship is not linear 
rather it follows an S shape and the logit  

00:05:09.520 --> 00:05:13.990
curve is one such as safe care that we 
could use and it's very commonly used. 

00:05:13.990 --> 00:05:19.420
So we get the probability of having had 
menarche given the age from the model. 

00:05:19.420 --> 00:05:25.390
The model can be expressed mathematically 
because all models are just equations and  

00:05:25.390 --> 00:05:29.950
the mathematical expressions for this 
logistic regression model is as follows. 

00:05:29.950 --> 00:05:34.210
First you have the linear regression model.
So that's the linear probability model because  

00:05:34.210 --> 00:05:40.720
we have one binary dependent variable and 
the regression model extends the the logistic  

00:05:40.720 --> 00:05:46.420
model extends the normal recursive model 
by taking a function of this fitted value. 

00:05:46.420 --> 00:05:51.250
So we calculate the linear prediction 
using our the observed data and then  

00:05:51.250 --> 00:05:58.030
we take a function here which gives 
us the logit curve and the functions. 

00:05:58.030 --> 00:06:02.290
The inverse of this function is called the 
link function and that's the logit function. 

00:06:02.290 --> 00:06:07.120
That this is the inverse whether our it's called 
an inverse function or a function doesn't matter. 

00:06:07.120 --> 00:06:11.770
The important thing for you to understand 
is that the instead of using the predictions  

00:06:11.770 --> 00:06:17.440
directly we apply a function that the 
predictions that make the prediction sort  

00:06:17.440 --> 00:06:25.060
transforms the predictions from a line 
to a curve. Okay, so how do we estimate  

00:06:25.060 --> 00:06:33.520
the model? We can apply OLS estimation. So we 
apply OLS estimation, then we do Diagnostics. 

00:06:33.520 --> 00:06:43.240
So we get the residuals here, there's a residual, 
so we can calculate it then we can plot,  

00:06:43.240 --> 00:06:47.800
residual versus Fida which is one of the 
standard diagnostic plots and then we can  

00:06:47.800 --> 00:06:53.380
check the normality of the residuals. We have 
two violations of regression assumptions. First  

00:06:53.380 --> 00:07:00.040
of all they are the residual is not normally 
distributed, so but that's not really a big deal. 

00:07:00.040 --> 00:07:06.430
It's only relevant in very small samples. 
Then we have our heteroscedasticity problem,  

00:07:06.430 --> 00:07:12.100
because the variation of the residuals 
here is a lot higher than the variation  

00:07:12.100 --> 00:07:16.900
here because the variance is the square 
of the difference, square of the residual. 

00:07:16.900 --> 00:07:24.190
Then our so we have our heteroscedasticity 
problem. We are in violation of  

00:07:24.190 --> 00:07:31.360
then MLR 5 and MLR 6 assumptions.
Whether that's a big deal or not we could  

00:07:31.360 --> 00:07:36.880
use a robust and others but there are also some 
computational difficulties when we try to apply  

00:07:36.880 --> 00:07:43.630
least squares approach to this kind of problem.
And because of those computational difficulties  

00:07:43.630 --> 00:07:47.620
and because OLS is not ideal anywhere 
because of violation of these assumptions,  

00:07:47.620 --> 00:07:54.280
we are estimate this using a different 
approach called maximum likelihood estimation.