WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:04.500 Logistical regression analysis is commonly used tool for binary dependent variables. 00:00:04.500 --> 00:00:10.170 A binary variable is a variable that receives the values of 1 and 0 and it's very commonly 00:00:10.170 --> 00:00:14.640 used for decisions that are either yes or no whether something happens or not. 00:00:14.640 --> 00:00:20.370 Whether a company decides to expand internationally or whether it decide 00:00:20.370 --> 00:00:24.390 to stay in the whole markets, whether a person is sick or not and that kind of data. 00:00:24.390 --> 00:00:30.540 To illustrate their losses regression analysis technique we need to have some 00:00:30.540 --> 00:00:34.050 example data and this example data are girls from Warsaw. 00:00:34.050 --> 00:00:40.710 And the girls range from about 10 years to about 18 years and the dependent variable 00:00:40.710 --> 00:00:45.360 here is called min arts and that's whether the girl has had the first period or not. 00:00:45.360 --> 00:00:51.570 So we can see here that girls at the age of 10 normally don't have had the first period, 00:00:51.570 --> 00:00:56.010 and then girls when they are 18 pretty much everyone has had the first period. 00:00:56.010 --> 00:01:02.730 And we want to explain this relationship between our age and menarche using regression analysis. 00:01:02.730 --> 00:01:08.130 There are a couple of problems when we apply normal regression analysis. 00:01:08.130 --> 00:01:17.970 For this kind of data set the first problem is that the regression line here goes over 1. 00:01:17.970 --> 00:01:23.730 So the value here, the regression line gives the expected value of 00:01:23.730 --> 00:01:28.920 the dependent variable given age. And in this case because the dependent 00:01:28.920 --> 00:01:36.360 variable is 0 and once the expected value is the expected probability of having menarche. 00:01:36.360 --> 00:01:42.870 When we draw the line that we have a problem here because the predictive 00:01:42.870 --> 00:01:48.210 probability for girls that are 18 exceeds 1, and probabilities bound between 1 and 0. 00:01:48.210 --> 00:01:55.320 Also we have negative probability here. T This also causes a problem for regression 00:01:55.320 --> 00:02:02.340 analysis because when we have small numbers small fitted values here, then all residuals 00:02:02.340 --> 00:02:09.150 are positives or they, the error term can't be independent of the bow of the defeated value. 00:02:09.150 --> 00:02:13.510 So regression analysis we are violating the noise energy assumption at least, 00:02:13.510 --> 00:02:20.440 and are the predictions don't make any sense. So using a linear model for this kind of data 00:02:20.440 --> 00:02:25.990 is problematic for these two reasons. Using this kind of linear model would 00:02:25.990 --> 00:02:32.470 be acceptable if most girls will be around here, so the linear approximation would be 00:02:32.470 --> 00:02:37.660 okay because it doesn't really are predict any negative values, because we can't go beyond 00:02:37.660 --> 00:02:42.910 the range of the data. But if we have negative predictions on predictions that exceed one within 00:02:42.910 --> 00:02:48.280 the range of the data, then we have problems. This model is called linear probability model 00:02:48.280 --> 00:02:53.410 and it's it can be used but there are typically better alternatives. 00:02:53.410 --> 00:02:59.650 One better alternative is to start to start discovering better alternatives we need to 00:02:59.650 --> 00:03:05.410 think about what's the relationship like and we can do a nonparametric analysis, for example 00:03:05.410 --> 00:03:11.170 we take a rolling average from the data. So the idea of rolling average is that we 00:03:11.170 --> 00:03:18.940 have here about 4,000 girls and then we take the first 500 here we calculate the mean for these 00:03:18.940 --> 00:03:25.840 first 500 and then we put mark a small dot here. The other is for these girls is zero because no 00:03:25.840 --> 00:03:33.610 one has at the menarche. Then we shift this window right to a bit we check the next 500 girls so we 00:03:33.610 --> 00:03:40.900 go from the second girl to the 501st girl like that we calculate the average, we mark it here. 00:03:40.900 --> 00:03:48.520 Then we go to the third girl to 500 second girl and we calculate average for that sub sample. 00:03:48.520 --> 00:03:53.590 Then we continue we'll go here we can see that the mean value is about 50% 00:03:53.590 --> 00:04:00.130 and our final when we calculate for all possible windows, we calculate the mean. 00:04:00.130 --> 00:04:04.300 We get this kind of a non parametric curve. It's nonparametric because we 00:04:04.300 --> 00:04:10.240 can't express this curve as a simple function. We can see that this is an s-shaped curve. 00:04:10.240 --> 00:04:16.300 So first when girls get a little bit older some girls start to have menarche but not 00:04:16.300 --> 00:04:22.090 many. And once you hit about 1314 then the rate of having menarche increases 00:04:22.090 --> 00:04:27.790 rapidly until it starts to decrease when you are about at about 15, when pretty much 00:04:27.790 --> 00:04:34.840 everyone has had menarche except for a couple exceptions.a And then it flattens out at one. 00:04:34.840 --> 00:04:42.820 This curve is are called a logistic curve. So here is the logistic curve and the idea 00:04:42.820 --> 00:04:47.800 of logistic regression analysis is that instead of fitting a line we fit this logistic curve. 00:04:47.800 --> 00:04:52.540 The logit curve and the interpretation of the result stays the same so the line 00:04:52.540 --> 00:04:58.840 gives us the expected probability of a girl having had menarche given their age. But this 00:04:58.840 --> 00:05:03.340 line as we can as we saw from the previous slide is a much better fit for the data. 00:05:03.340 --> 00:05:09.520 So the data the relationship is not linear rather it follows an S shape and the logit 00:05:09.520 --> 00:05:13.990 curve is one such as safe care that we could use and it's very commonly used. 00:05:13.990 --> 00:05:19.420 So we get the probability of having had menarche given the age from the model. 00:05:19.420 --> 00:05:25.390 The model can be expressed mathematically because all models are just equations and 00:05:25.390 --> 00:05:29.950 the mathematical expressions for this logistic regression model is as follows. 00:05:29.950 --> 00:05:34.210 First you have the linear regression model. So that's the linear probability model because 00:05:34.210 --> 00:05:40.720 we have one binary dependent variable and the regression model extends the the logistic 00:05:40.720 --> 00:05:46.420 model extends the normal recursive model by taking a function of this fitted value. 00:05:46.420 --> 00:05:51.250 So we calculate the linear prediction using our the observed data and then 00:05:51.250 --> 00:05:58.030 we take a function here which gives us the logit curve and the functions. 00:05:58.030 --> 00:06:02.290 The inverse of this function is called the link function and that's the logit function. 00:06:02.290 --> 00:06:07.120 That this is the inverse whether our it's called an inverse function or a function doesn't matter. 00:06:07.120 --> 00:06:11.770 The important thing for you to understand is that the instead of using the predictions 00:06:11.770 --> 00:06:17.440 directly we apply a function that the predictions that make the prediction sort 00:06:17.440 --> 00:06:25.060 transforms the predictions from a line to a curve. Okay, so how do we estimate 00:06:25.060 --> 00:06:33.520 the model? We can apply OLS estimation. So we apply OLS estimation, then we do Diagnostics. 00:06:33.520 --> 00:06:43.240 So we get the residuals here, there's a residual, so we can calculate it then we can plot, 00:06:43.240 --> 00:06:47.800 residual versus Fida which is one of the standard diagnostic plots and then we can 00:06:47.800 --> 00:06:53.380 check the normality of the residuals. We have two violations of regression assumptions. First 00:06:53.380 --> 00:07:00.040 of all they are the residual is not normally distributed, so but that's not really a big deal. 00:07:00.040 --> 00:07:06.430 It's only relevant in very small samples. Then we have our heteroscedasticity problem, 00:07:06.430 --> 00:07:12.100 because the variation of the residuals here is a lot higher than the variation 00:07:12.100 --> 00:07:16.900 here because the variance is the square of the difference, square of the residual. 00:07:16.900 --> 00:07:24.190 Then our so we have our heteroscedasticity problem. We are in violation of 00:07:24.190 --> 00:07:31.360 then MLR 5 and MLR 6 assumptions. Whether that's a big deal or not we could 00:07:31.360 --> 00:07:36.880 use a robust and others but there are also some computational difficulties when we try to apply 00:07:36.880 --> 00:07:43.630 least squares approach to this kind of problem. And because of those computational difficulties 00:07:43.630 --> 00:07:47.620 and because OLS is not ideal anywhere because of violation of these assumptions, 00:07:47.620 --> 00:07:54.280 we are estimate this using a different approach called maximum likelihood estimation.