WEBVTT

00:00:00.060 --> 00:00:06.720
Another feature that people typically check 
from their data is the presence of outliers.

00:00:06.720 --> 00:00:10.950
Outliers are influential 
observations or observations  

00:00:10.950 --> 00:00:12.660
that are different from other observations.

00:00:12.660 --> 00:00:17.520
While outliers are not an 
assumption or lack of outliers  

00:00:17.520 --> 00:00:19.590
is not an assumption in regression analysis.

00:00:19.590 --> 00:00:22.860
There are reasons sometimes to delete them.

00:00:22.860 --> 00:00:25.170
So you have to understand why you have an outlier.

00:00:25.170 --> 00:00:26.970
Let's take a look at outliers.

00:00:26.970 --> 00:00:29.850
Here we have that prestige data set.

00:00:29.850 --> 00:00:34.590
We have a regression line 
of the effect of education  

00:00:34.590 --> 00:00:39.090
on prestige and it's a clean nice regression line.

00:00:39.090 --> 00:00:43.680
The observations are homoscedastic and they 
are spread evenly on the regression line.

00:00:44.730 --> 00:00:45.510
And there are no problems.

00:00:45.510 --> 00:00:51.630
What happens if we have one observation 
that is very far apart from others.

00:00:51.630 --> 00:00:55.020
We have an outlier here so 
what will that outlier do.

00:00:55.020 --> 00:01:01.980
The outlier will pull the regression 
line toward itself and now with the  

00:01:01.980 --> 00:01:07.770
outlier including the data regression 
line goes the slope is a bit less.

00:01:07.770 --> 00:01:12.630
And also it no longer goes through the 
middle of the remaining observations  

00:01:12.630 --> 00:01:16.620
rather it goes kind of like 
too low here and too high here.

00:01:16.620 --> 00:01:21.840
So the outlier clearly, we 
don't want to have it here.

00:01:21.840 --> 00:01:27.270
But before we decide what to do with the outlier 
we have to consider the different mechanisms.

00:01:27.270 --> 00:01:29.460
So what is this observation really about.

00:01:29.460 --> 00:01:33.540
And it could be that it's a data entry mistake.

00:01:33.540 --> 00:01:41.760
So the occupations prestigiousness really should 
be 70 but somebody wrote 17 to our dataset.

00:01:41.760 --> 00:01:45.120
Or it is possible that this is an outlier if these  

00:01:45.120 --> 00:01:50.640
were companies it could be a company 
that is outside of our population.

00:01:50.640 --> 00:01:56.400
If we do a survey of small technology 
companies then we can accidentally  

00:01:56.400 --> 00:01:59.640
send the survey to a large technology company.

00:01:59.640 --> 00:02:04.290
And the large technology company would be outside  

00:02:04.290 --> 00:02:08.550
of our population so it's not part 
of our sample or our population.

00:02:08.550 --> 00:02:12.360
Or it could be a case that is very unique.

00:02:12.360 --> 00:02:17.430
If we're studying the growth of 
small technology-based companies,  

00:02:17.430 --> 00:02:24.210
then for example Supercell Finnish 
game developer that makes billions  

00:02:24.210 --> 00:02:28.950
of euros of revenue on games 
on App Store is an outlier.

00:02:28.950 --> 00:02:33.720
Because while they technically are a 
small and young technology-based company, 

00:02:33.720 --> 00:02:37.350
they are so different from other 
companies in their performance,

00:02:37.350 --> 00:02:43.710
that using that company when our regression model 
typically wants to explain the bulk of the data,

00:02:43.710 --> 00:02:45.240
so where most of the observations are, 

00:02:45.240 --> 00:02:51.420
then including that particular outlier is 
something that we probably don't want to do.

00:02:51.420 --> 00:02:55.980
So outliers are either, they could be 
observations that are truly unique, 

00:02:55.980 --> 00:02:59.460
they could be worth studying 
separately as case studies, 

00:02:59.460 --> 00:03:05.070
they could be data entry mistakes 
and or they could be observations  

00:03:05.070 --> 00:03:08.700
that don't belong to our population,
and we're including the sample accidentally.

00:03:08.700 --> 00:03:12.030
The effects of outlier depend 
on two different things.

00:03:12.030 --> 00:03:14.850
So we have first the residuals.

00:03:14.850 --> 00:03:16.680
How far the outlier is from the regression line?

00:03:16.680 --> 00:03:22.320
The outlier pulls the regression 
line toward itself and the strength  

00:03:22.320 --> 00:03:26.340
of or the force is are related to the residual.

00:03:26.340 --> 00:03:28.170
So we want to minimize the 
sum of squared residuals.

00:03:28.710 --> 00:03:35.610
If one observation is very large residuals 
then it pulls very strongly the regression  

00:03:35.610 --> 00:03:38.790
line because it's the square 
of the residual that matters.

00:03:38.790 --> 00:03:44.880
Another concept is the leverage so if 
we are pulling the regression line here,  

00:03:44.880 --> 00:03:46.410
where there are few observations,

00:03:46.410 --> 00:03:51.630
then we have a lot more leverage 
and the regression line moves more, 

00:03:51.630 --> 00:03:55.560
than if we pull it from the middle here 
where there are lots of observations.

00:03:55.560 --> 00:04:01.800
So pulling the regression line from here has zero 
leverage and the outlier wouldn't really matter.

00:04:02.550 --> 00:04:07.110
We check at leverage and residual 
when we do outlier diagnostics.

00:04:09.450 --> 00:04:13.980
When we identify outliers there are 
three important steps in the process.

00:04:13.980 --> 00:04:17.940
And Deephouse's article is a really great 
example of how you deal with outliers.

00:04:17.940 --> 00:04:24.990
First you report how did you identify the 
outliers and Deephouse used residuals.

00:04:24.990 --> 00:04:29.520
They identified companies or 
banks with large residuals, 

00:04:29.520 --> 00:04:33.090
then they analyzed the outliers.

00:04:33.090 --> 00:04:37.110
So what is the outlier like, 
is it the data entry mistake, 

00:04:37.110 --> 00:04:42.420
is it a company that shouldn't be in the sample,
or is it a unique case that is not  

00:04:42.420 --> 00:04:47.790
representative of the other banks,
even if it belongs technically to population.

00:04:47.790 --> 00:04:52.230
They identified that there were 
two banks that were merging.

00:04:52.230 --> 00:04:56.640
And if you have banks that 
are merging then that is  

00:04:56.640 --> 00:04:59.100
probably quite different observation than others.

00:04:59.100 --> 00:05:02.760
And they decided to drop that 
observation from the sample.

00:05:02.760 --> 00:05:03.960
So that's the third step. 

00:05:03.960 --> 00:05:08.280
Explain what you did and what 
was the outcome of doing so.

00:05:08.280 --> 00:05:11.880
They explained that what was the 
effect of dropping the outlier,

00:05:11.880 --> 00:05:14.400
and they conclude that it didn't really make  

00:05:14.400 --> 00:05:18.330
a difference of whether they include 
that observation as a sample or not.

00:05:18.330 --> 00:05:19.740
And that's a very good example.

00:05:21.000 --> 00:05:24.870
If you want read more about 
outliers and good practices, 

00:05:24.870 --> 00:05:27.810
I recommend this paper by 
Aguinis and his students.

00:05:27.810 --> 00:05:32.610
They write how you identify 
outliers in regression analysis, 

00:05:32.610 --> 00:05:36.630
structure regression models and 
multi-level models and what you can deal, 

00:05:36.630 --> 00:05:38.490
how you can deal with the outliers.

00:05:38.490 --> 00:05:44.160
Sometimes outliers are problematic, sometimes 
there are data entry mistakes which can be fixed.

00:05:44.160 --> 00:05:49.980
Sometimes outliers are truly interesting 
cases that you should study separately.

00:05:49.980 --> 00:05:54.480
Yeah so that's what the Deephouse paper did.