WEBVTT 00:00:00.060 --> 00:00:06.720 Another feature that people typically check from their data is the presence of outliers. 00:00:06.720 --> 00:00:10.950 Outliers are influential observations or observations 00:00:10.950 --> 00:00:12.660 that are different from other observations. 00:00:12.660 --> 00:00:17.520 While outliers are not an assumption or lack of outliers 00:00:17.520 --> 00:00:19.590 is not an assumption in regression analysis. 00:00:19.590 --> 00:00:22.860 There are reasons sometimes to delete them. 00:00:22.860 --> 00:00:25.170 So you have to understand why you have an outlier. 00:00:25.170 --> 00:00:26.970 Let's take a look at outliers. 00:00:26.970 --> 00:00:29.850 Here we have that prestige data set. 00:00:29.850 --> 00:00:34.590 We have a regression line of the effect of education 00:00:34.590 --> 00:00:39.090 on prestige and it's a clean nice regression line. 00:00:39.090 --> 00:00:43.680 The observations are homoscedastic and they are spread evenly on the regression line. 00:00:44.730 --> 00:00:45.510 And there are no problems. 00:00:45.510 --> 00:00:51.630 What happens if we have one observation that is very far apart from others. 00:00:51.630 --> 00:00:55.020 We have an outlier here so what will that outlier do. 00:00:55.020 --> 00:01:01.980 The outlier will pull the regression line toward itself and now with the 00:01:01.980 --> 00:01:07.770 outlier including the data regression line goes the slope is a bit less. 00:01:07.770 --> 00:01:12.630 And also it no longer goes through the middle of the remaining observations 00:01:12.630 --> 00:01:16.620 rather it goes kind of like too low here and too high here. 00:01:16.620 --> 00:01:21.840 So the outlier clearly, we don't want to have it here. 00:01:21.840 --> 00:01:27.270 But before we decide what to do with the outlier we have to consider the different mechanisms. 00:01:27.270 --> 00:01:29.460 So what is this observation really about. 00:01:29.460 --> 00:01:33.540 And it could be that it's a data entry mistake. 00:01:33.540 --> 00:01:41.760 So the occupations prestigiousness really should be 70 but somebody wrote 17 to our dataset. 00:01:41.760 --> 00:01:45.120 Or it is possible that this is an outlier if these 00:01:45.120 --> 00:01:50.640 were companies it could be a company that is outside of our population. 00:01:50.640 --> 00:01:56.400 If we do a survey of small technology companies then we can accidentally 00:01:56.400 --> 00:01:59.640 send the survey to a large technology company. 00:01:59.640 --> 00:02:04.290 And the large technology company would be outside 00:02:04.290 --> 00:02:08.550 of our population so it's not part of our sample or our population. 00:02:08.550 --> 00:02:12.360 Or it could be a case that is very unique. 00:02:12.360 --> 00:02:17.430 If we're studying the growth of small technology-based companies, 00:02:17.430 --> 00:02:24.210 then for example Supercell Finnish game developer that makes billions 00:02:24.210 --> 00:02:28.950 of euros of revenue on games on App Store is an outlier. 00:02:28.950 --> 00:02:33.720 Because while they technically are a small and young technology-based company, 00:02:33.720 --> 00:02:37.350 they are so different from other companies in their performance, 00:02:37.350 --> 00:02:43.710 that using that company when our regression model typically wants to explain the bulk of the data, 00:02:43.710 --> 00:02:45.240 so where most of the observations are, 00:02:45.240 --> 00:02:51.420 then including that particular outlier is something that we probably don't want to do. 00:02:51.420 --> 00:02:55.980 So outliers are either, they could be observations that are truly unique, 00:02:55.980 --> 00:02:59.460 they could be worth studying separately as case studies, 00:02:59.460 --> 00:03:05.070 they could be data entry mistakes and or they could be observations 00:03:05.070 --> 00:03:08.700 that don't belong to our population, and we're including the sample accidentally. 00:03:08.700 --> 00:03:12.030 The effects of outlier depend on two different things. 00:03:12.030 --> 00:03:14.850 So we have first the residuals. 00:03:14.850 --> 00:03:16.680 How far the outlier is from the regression line? 00:03:16.680 --> 00:03:22.320 The outlier pulls the regression line toward itself and the strength 00:03:22.320 --> 00:03:26.340 of or the force is are related to the residual. 00:03:26.340 --> 00:03:28.170 So we want to minimize the sum of squared residuals. 00:03:28.710 --> 00:03:35.610 If one observation is very large residuals then it pulls very strongly the regression 00:03:35.610 --> 00:03:38.790 line because it's the square of the residual that matters. 00:03:38.790 --> 00:03:44.880 Another concept is the leverage so if we are pulling the regression line here, 00:03:44.880 --> 00:03:46.410 where there are few observations, 00:03:46.410 --> 00:03:51.630 then we have a lot more leverage and the regression line moves more, 00:03:51.630 --> 00:03:55.560 than if we pull it from the middle here where there are lots of observations. 00:03:55.560 --> 00:04:01.800 So pulling the regression line from here has zero leverage and the outlier wouldn't really matter. 00:04:02.550 --> 00:04:07.110 We check at leverage and residual when we do outlier diagnostics. 00:04:09.450 --> 00:04:13.980 When we identify outliers there are three important steps in the process. 00:04:13.980 --> 00:04:17.940 And Deephouse's article is a really great example of how you deal with outliers. 00:04:17.940 --> 00:04:24.990 First you report how did you identify the outliers and Deephouse used residuals. 00:04:24.990 --> 00:04:29.520 They identified companies or banks with large residuals, 00:04:29.520 --> 00:04:33.090 then they analyzed the outliers. 00:04:33.090 --> 00:04:37.110 So what is the outlier like, is it the data entry mistake, 00:04:37.110 --> 00:04:42.420 is it a company that shouldn't be in the sample, or is it a unique case that is not 00:04:42.420 --> 00:04:47.790 representative of the other banks, even if it belongs technically to population. 00:04:47.790 --> 00:04:52.230 They identified that there were two banks that were merging. 00:04:52.230 --> 00:04:56.640 And if you have banks that are merging then that is 00:04:56.640 --> 00:04:59.100 probably quite different observation than others. 00:04:59.100 --> 00:05:02.760 And they decided to drop that observation from the sample. 00:05:02.760 --> 00:05:03.960 So that's the third step. 00:05:03.960 --> 00:05:08.280 Explain what you did and what was the outcome of doing so. 00:05:08.280 --> 00:05:11.880 They explained that what was the effect of dropping the outlier, 00:05:11.880 --> 00:05:14.400 and they conclude that it didn't really make 00:05:14.400 --> 00:05:18.330 a difference of whether they include that observation as a sample or not. 00:05:18.330 --> 00:05:19.740 And that's a very good example. 00:05:21.000 --> 00:05:24.870 If you want read more about outliers and good practices, 00:05:24.870 --> 00:05:27.810 I recommend this paper by Aguinis and his students. 00:05:27.810 --> 00:05:32.610 They write how you identify outliers in regression analysis, 00:05:32.610 --> 00:05:36.630 structure regression models and multi-level models and what you can deal, 00:05:36.630 --> 00:05:38.490 how you can deal with the outliers. 00:05:38.490 --> 00:05:44.160 Sometimes outliers are problematic, sometimes there are data entry mistakes which can be fixed. 00:05:44.160 --> 00:05:49.980 Sometimes outliers are truly interesting cases that you should study separately. 00:05:49.980 --> 00:05:54.480 Yeah so that's what the Deephouse paper did.