WEBVTT 00:00:00.030 --> 00:00:03.870 The regression analysis basically  draws a line through the data.  00:00:03.870 --> 00:00:08.130 And the line is defined by regression  coefficients or the beta's in the model. 00:00:08.640 --> 00:00:13.710 Our task now is to figure out  how we estimate those betas.  00:00:13.710 --> 00:00:18.720 So we give regression analysis some data  of the dependent variable, and one or   00:00:18.720 --> 00:00:25.410 more independent variables, and the regression  analysis tells us where the best line goes. 00:00:25.410 --> 00:00:29.760 The line is defined by the basis. How does the analysis know which line is the best? 00:00:29.760 --> 00:00:33.060 To answer that question we'll  be looking at some example data. 00:00:33.060 --> 00:00:37.710 This same data set is used in  one of the assignments and it   00:00:37.710 --> 00:00:40.800 comes from the census of Canada from early 70s. 00:00:40.800 --> 00:00:44.340 The data set is called Prestige and we have,   00:00:44.340 --> 00:00:48.210 the observations here are  occupations the 102 of those. 00:00:48.210 --> 00:00:51.450 And we have data for the education which   00:00:51.450 --> 00:00:58.890 is the mean number of years that  occupants of that occupation hold. 00:00:58.890 --> 00:01:01.200 What's the average income for that occupation? 00:01:02.580 --> 00:01:07.590 How many women there are from  0 to 100% in that occupation?  00:01:07.590 --> 00:01:13.080 What is the prestigiousness on a prestigious score   00:01:13.080 --> 00:01:15.840 that is defined some way that  we don't really care about? 00:01:15.840 --> 00:01:19.260 We have sensors code which is an  identifier that we don't need. 00:01:19.260 --> 00:01:23.910 And we have type which is a categorical  variable that can be either white color,   00:01:23.910 --> 00:01:25.320 blue color a professional. 00:01:25.320 --> 00:01:28.950 Then there's some information about  where the data comes from and this   00:01:28.950 --> 00:01:34.740 is a printout from the R packages or CAR  documentation that contains this dataset. 00:01:34.740 --> 00:01:43.620 So we will be doing a regression estimation and  our task is to explain prestige with education. 00:01:43.620 --> 00:01:48.810 How much the prestigiousness is of  an occupation depends on the amount   00:01:48.810 --> 00:01:52.800 of all education in years that  is required for that occupation. 00:01:53.790 --> 00:01:59.340 Our regression model, we said that regression  the prestige is a weighted sum of beta 0 the   00:01:59.340 --> 00:02:04.110 intercept or at the base level for people with  no education or professors with no education.  00:02:04.110 --> 00:02:11.010 Beta 1 which is the effect of education  plus some variation that the model   00:02:11.010 --> 00:02:17.520 doesn't explain the error term u. Our task is to estimate beta0 and   00:02:17.520 --> 00:02:23.970 beta1 which define the regression line. And estimates in statistical analysis   00:02:23.970 --> 00:02:30.360 are usually denoted by drawing this kind  of carrot or hat symbol over the beta. 00:02:30.360 --> 00:02:34.260 So this is beta hat 0, this is beta hat 1. 00:02:34.260 --> 00:02:38.250 They are estimates of this  population regression model. 00:02:38.250 --> 00:02:44.280 So hat denotes that we don't know a value but  we have calculated the value from a sample.  00:02:44.280 --> 00:02:48.630 And that serves as an estimate for what  is the relationship in the population. 00:02:48.630 --> 00:02:56.100 Now we need to have a rule to set the line. We have drawn the regression line here and the   00:02:56.100 --> 00:02:58.980 regression line should go through  the middle of the observations. 00:02:58.980 --> 00:03:03.930 So that there are about the same amount  observations above the line and below the line. 00:03:03.930 --> 00:03:08.280 And the observations also are assumed to  be normally distributed around the line.  00:03:08.280 --> 00:03:14.820 So that most observations are clustered closer,  and the line and, some are further from the line.  00:03:14.820 --> 00:03:19.740 Telling a person to draw a  line in the middle is easy,   00:03:19.740 --> 00:03:21.600 and person will probably draw a line like that. 00:03:21.600 --> 00:03:25.230 But you can't tell a computer to draw a line   00:03:25.230 --> 00:03:27.840 in the middle because in the  middle is not well-defined.  00:03:28.410 --> 00:03:32.370 You have to have a specific  rule on how to draw the line.  00:03:32.370 --> 00:03:36.630 And that specific rule of  estimation is called an estimator. 00:03:36.630 --> 00:03:43.080 So estimator is any rule or strategy or  principle that we can apply to calculate   00:03:43.080 --> 00:03:48.390 values for the quantities of  interest from a sample data. 00:03:50.610 --> 00:03:54.810 Let's take a look at what are some  properties of good estimators. 00:03:54.810 --> 00:03:58.560 We covered this in another  video before but let's revise. 00:03:58.560 --> 00:04:02.520 So we need to have estimates that are consistent. 00:04:02.520 --> 00:04:10.620 Consistency means that when we have the full  population data, our estimates beta hat 0   00:04:10.620 --> 00:04:18.870 and beta hat 1 or equal to beta 0 and beta 1. So these estimates will be the population values. 00:04:18.870 --> 00:04:22.710 In other words if we have the  full data a consistent estimator   00:04:22.710 --> 00:04:26.040 gives us the correct result for that population. 00:04:26.040 --> 00:04:32.910 Then we have another important which is  unbiased, is another important property. 00:04:32.910 --> 00:04:37.530 Unbiased, this means that even if we don't  have the full data set, a large sample,   00:04:37.530 --> 00:04:43.440 our estimates will be correct on average  if we repeat the study over and over.  00:04:43.440 --> 00:04:46.890 So the estimates are correct  on average is unbiasness. 00:04:46.890 --> 00:04:50.820 Then we have efficiency which  means that the estimates are   00:04:50.820 --> 00:04:56.460 more precise or more accurate than  any possible alternative estimator. 00:04:56.460 --> 00:05:03.150 So efficiency is a property that we can use  to compare to estimators that are unbiased. 00:05:03.150 --> 00:05:07.680 Finally the repeated estimates from repeated   00:05:07.680 --> 00:05:12.390 samples should be normally distributed  or at least follow a known distribution. 00:05:12.390 --> 00:05:17.670 And that is important for statistical  inference or calculating the p-values. 00:05:17.670 --> 00:05:24.900 One really good rule for  estimating the regression model,   00:05:24.900 --> 00:05:29.550 actually the best rule, is to use the residuals. 00:05:29.550 --> 00:05:34.170 And when we have a regression  line here, we can see that the   00:05:34.170 --> 00:05:40.950 observations are not exactly on the line. Instead they are somewhere around the line. 00:05:40.950 --> 00:05:43.560 And the line is that the perfect prediction. 00:05:43.560 --> 00:05:49.110 So this is the amount of our income that would  be predicted based on your education level.  00:05:49.110 --> 00:05:53.040 And then the difference between the actual income   00:05:53.040 --> 00:05:57.690 and the predicted income by the  model is called the residual. 00:05:57.690 --> 00:06:02.190 So that is the part that the model  doesn't explain of the dependent variable. 00:06:02.190 --> 00:06:12.090 We can calculate this regression line by plugging  in our estimator for beta0 and beta1 x education.  00:06:12.090 --> 00:06:17.010 That gives the line and  then these whatever remains.  00:06:17.010 --> 00:06:21.300 What's the difference between a line  and the observation is the residual. 00:06:21.300 --> 00:06:30.720 So the great or best rule for estimating this  regression model is to find the center line so   00:06:30.720 --> 00:06:36.900 that the sum of these residuals raised to  the second power is as small as possible. 00:06:36.900 --> 00:06:43.950 So how we do it in practice. We set the line somewhere, we calculate residuals   00:06:43.950 --> 00:06:51.030 for each observation, we erased each residual  to the second power, we take a sum and then we   00:06:51.030 --> 00:06:57.240 try different values for the beta's to make the  sum of squared residuals as small as possible. 00:06:57.240 --> 00:07:01.500 This is called the ordinary  least-squares estimator and   00:07:01.500 --> 00:07:04.560 it has been proven to be consistent unbiased. 00:07:04.560 --> 00:07:08.400 And efficient and it produces  normally distributed estimates   00:07:08.400 --> 00:07:11.130 understand assumptions that we will cover later.