WEBVTT

00:00:00.060 --> 00:00:06.720
Another feature that people typically check&nbsp;
from their data is the presence of outliers.

00:00:06.720 --> 00:00:10.950
Outliers are influential&nbsp;
observations or observations&nbsp;&nbsp;

00:00:10.950 --> 00:00:12.660
that are different from other observations.

00:00:12.660 --> 00:00:17.520
While outliers are not an&nbsp;
assumption or lack of outliers&nbsp;&nbsp;

00:00:17.520 --> 00:00:19.590
is not an assumption in regression analysis.

00:00:19.590 --> 00:00:22.860
There are reasons sometimes to delete them.

00:00:22.860 --> 00:00:25.170
So you have to understand why you have an outlier.

00:00:25.170 --> 00:00:26.970
Let's take a look at outliers.

00:00:26.970 --> 00:00:29.850
Here we have that prestige data set.

00:00:29.850 --> 00:00:34.590
We have a regression line&nbsp;
of the effect of education&nbsp;&nbsp;

00:00:34.590 --> 00:00:39.090
on prestige and it's a clean nice regression line.

00:00:39.090 --> 00:00:43.680
The observations are homoscedastic and they&nbsp;
are spread evenly on the regression line.

00:00:44.730 --> 00:00:45.510
And there are no problems.

00:00:45.510 --> 00:00:51.630
What happens if we have one observation&nbsp;
that is very far apart from others.

00:00:51.630 --> 00:00:55.020
We have an outlier here so&nbsp;
what will that outlier do.

00:00:55.020 --> 00:01:01.980
The outlier will pull the regression&nbsp;
line toward itself and now with the&nbsp;&nbsp;

00:01:01.980 --> 00:01:07.770
outlier including the data regression&nbsp;
line goes the slope is a bit less.

00:01:07.770 --> 00:01:12.630
And also it no longer goes through the&nbsp;
middle of the remaining observations&nbsp;&nbsp;

00:01:12.630 --> 00:01:16.620
rather it goes kind of like&nbsp;
too low here and too high here.

00:01:16.620 --> 00:01:21.840
So the outlier clearly, we&nbsp;
don't want to have it here.

00:01:21.840 --> 00:01:27.270
But before we decide what to do with the outlier&nbsp;
we have to consider the different mechanisms.

00:01:27.270 --> 00:01:29.460
So what is this observation really about.

00:01:29.460 --> 00:01:33.540
And it could be that it's a data entry mistake.

00:01:33.540 --> 00:01:41.760
So the occupations prestigiousness really should&nbsp;
be 70 but somebody wrote 17 to our dataset.

00:01:41.760 --> 00:01:45.120
Or it is possible that this is an outlier if these&nbsp;&nbsp;

00:01:45.120 --> 00:01:50.640
were companies it could be a company&nbsp;
that is outside of our population.

00:01:50.640 --> 00:01:56.400
If we do a survey of small technology&nbsp;
companies then we can accidentally&nbsp;&nbsp;

00:01:56.400 --> 00:01:59.640
send the survey to a large technology company.

00:01:59.640 --> 00:02:04.290
And the large technology company would be outside&nbsp;&nbsp;

00:02:04.290 --> 00:02:08.550
of our population so it's not part&nbsp;
of our sample or our population.

00:02:08.550 --> 00:02:12.360
Or it could be a case that is very unique.

00:02:12.360 --> 00:02:17.430
If we're studying the growth of&nbsp;
small technology-based companies,&nbsp;&nbsp;

00:02:17.430 --> 00:02:24.210
then for example Supercell Finnish&nbsp;
game developer that makes billions&nbsp;&nbsp;

00:02:24.210 --> 00:02:28.950
of euros of revenue on games&nbsp;
on App Store is an outlier.

00:02:28.950 --> 00:02:33.720
Because while they technically are a&nbsp;
small and young technology-based company,&nbsp;

00:02:33.720 --> 00:02:37.350
they are so different from other&nbsp;
companies in their performance,

00:02:37.350 --> 00:02:43.710
that using that company when our regression model&nbsp;
typically wants to explain the bulk of the data,

00:02:43.710 --> 00:02:45.240
so where most of the observations are,&nbsp;

00:02:45.240 --> 00:02:51.420
then including that particular outlier is&nbsp;
something that we probably don't want to do.

00:02:51.420 --> 00:02:55.980
So outliers are either, they could be&nbsp;
observations that are truly unique,&nbsp;

00:02:55.980 --> 00:02:59.460
they could be worth studying&nbsp;
separately as case studies,&nbsp;

00:02:59.460 --> 00:03:05.070
they could be data entry mistakes&nbsp;
and or they could be observations&nbsp;&nbsp;

00:03:05.070 --> 00:03:08.700
that don't belong to our population,
and we're including the sample accidentally.

00:03:08.700 --> 00:03:12.030
The effects of outlier depend&nbsp;
on two different things.

00:03:12.030 --> 00:03:14.850
So we have first the residuals.

00:03:14.850 --> 00:03:16.680
How far the outlier is from the regression line?

00:03:16.680 --> 00:03:22.320
The outlier pulls the regression&nbsp;
line toward itself and the strength&nbsp;&nbsp;

00:03:22.320 --> 00:03:26.340
of or the force is are related to the residual.

00:03:26.340 --> 00:03:28.170
So we want to minimize the&nbsp;
sum of squared residuals.

00:03:28.710 --> 00:03:35.610
If one observation is very large residuals&nbsp;
then it pulls very strongly the regression&nbsp;&nbsp;

00:03:35.610 --> 00:03:38.790
line because it's the square&nbsp;
of the residual that matters.

00:03:38.790 --> 00:03:44.880
Another concept is the leverage so if&nbsp;
we are pulling the regression line here,&nbsp;&nbsp;

00:03:44.880 --> 00:03:46.410
where there are few observations,

00:03:46.410 --> 00:03:51.630
then we have a lot more leverage&nbsp;
and the regression line moves more,&nbsp;

00:03:51.630 --> 00:03:55.560
than if we pull it from the middle here&nbsp;
where there are lots of observations.

00:03:55.560 --> 00:04:01.800
So pulling the regression line from here has zero&nbsp;
leverage and the outlier wouldn't really matter.

00:04:02.550 --> 00:04:07.110
We check at leverage and residual&nbsp;
when we do outlier diagnostics.

00:04:09.450 --> 00:04:13.980
When we identify outliers there are&nbsp;
three important steps in the process.

00:04:13.980 --> 00:04:17.940
And Deephouse's article is a really great&nbsp;
example of how you deal with outliers.

00:04:17.940 --> 00:04:24.990
First you report how did you identify the&nbsp;
outliers and Deephouse used residuals.

00:04:24.990 --> 00:04:29.520
They identified companies or&nbsp;
banks with large residuals,&nbsp;

00:04:29.520 --> 00:04:33.090
then they analyzed the outliers.

00:04:33.090 --> 00:04:37.110
So what is the outlier like,&nbsp;
is it the data entry mistake,&nbsp;

00:04:37.110 --> 00:04:42.420
is it a company that shouldn't be in the sample,
or is it a unique case that is not&nbsp;&nbsp;

00:04:42.420 --> 00:04:47.790
representative of the other banks,
even if it belongs technically to population.

00:04:47.790 --> 00:04:52.230
They identified that there were&nbsp;
two banks that were merging.

00:04:52.230 --> 00:04:56.640
And if you have banks that&nbsp;
are merging then that is&nbsp;&nbsp;

00:04:56.640 --> 00:04:59.100
probably quite different observation than others.

00:04:59.100 --> 00:05:02.760
And they decided to drop that&nbsp;
observation from the sample.

00:05:02.760 --> 00:05:03.960
So that's the third step.&nbsp;

00:05:03.960 --> 00:05:08.280
Explain what you did and what&nbsp;
was the outcome of doing so.

00:05:08.280 --> 00:05:11.880
They explained that what was the&nbsp;
effect of dropping the outlier,

00:05:11.880 --> 00:05:14.400
and they conclude that it didn't really make&nbsp;&nbsp;

00:05:14.400 --> 00:05:18.330
a difference of whether they include&nbsp;
that observation as a sample or not.

00:05:18.330 --> 00:05:19.740
And that's a very good example.

00:05:21.000 --> 00:05:24.870
If you want read more about&nbsp;
outliers and good practices,&nbsp;

00:05:24.870 --> 00:05:27.810
I recommend this paper by&nbsp;
Aguinis and his students.

00:05:27.810 --> 00:05:32.610
They write how you identify&nbsp;
outliers in regression analysis,&nbsp;

00:05:32.610 --> 00:05:36.630
structure regression models and&nbsp;
multi-level models and what you can deal,&nbsp;

00:05:36.630 --> 00:05:38.490
how you can deal with the outliers.

00:05:38.490 --> 00:05:44.160
Sometimes outliers are problematic, sometimes&nbsp;
there are data entry mistakes which can be fixed.

00:05:44.160 --> 00:05:49.980
Sometimes outliers are truly interesting&nbsp;
cases that you should study separately.

00:05:49.980 --> 00:05:54.480
Yeah so that's what the Deephouse paper did.