WEBVTT Kind: captions Language: en 00:00:00.060 --> 00:00:04.980 In this video I will explain the  suppression effect in regression analysis.  00:00:04.980 --> 00:00:11.070 The suppression effect is a term that is  used for a feature of regression analysis.  00:00:11.070 --> 00:00:15.360 You don't actually have to understand  what the term suppression means,   00:00:15.360 --> 00:00:20.850 but you have to understand why certain results  are sometimes occur in regression analysis.  00:00:20.850 --> 00:00:25.410 You will basically need the term suppression only   00:00:25.410 --> 00:00:30.270 if a reviewer argues that you  should explain suppression.  00:00:30.270 --> 00:00:34.500 For example so there's I don't think  there is an any valid reason to discuss   00:00:34.500 --> 00:00:38.910 suppression in an empirical paper  unless you're either asked you to do.  00:00:38.910 --> 00:00:42.870 Let's take a look at Hekmans paper  because they are mainstream suppression.  00:00:42.870 --> 00:00:49.890 So they explain that in their correlation  table and regression table the physician age,   00:00:49.890 --> 00:00:53.430 it has different sign. So in the correlation table   00:00:53.430 --> 00:00:59.520 the physician age correlation with physician  age and patient satisfaction is positive,   00:00:59.520 --> 00:01:03.900 in regression results it's negative. And that's the suppression effect.  00:01:03.900 --> 00:01:12.870 The technical definition is unimportant. Then they explain that these variables may somehow   00:01:12.870 --> 00:01:17.940 be suppression the variance of the dependent  variable that is irrelevant to its prediction.  00:01:17.940 --> 00:01:22.560 I don't I don't understand what that  means, so that doesn't really have meaning.  00:01:22.560 --> 00:01:30.240 Then are they cite a textbook in statistical  analysis that presumably explains what they mean,   00:01:30.240 --> 00:01:34.410 unfortunately that's a big book  and they don't give a page number,   00:01:34.410 --> 00:01:38.850 so we can't really meaningfully check  what that book says about suppression.  00:01:38.850 --> 00:01:43.500 So whenever you you explain something and  then you give a reader a book to read,   00:01:43.500 --> 00:01:50.220 then at least give the reader some indication  which chapter or which page of that book explains   00:01:50.220 --> 00:01:54.750 the fact that you're referring to. Otherwise this is our your your are   00:01:54.750 --> 00:01:59.040 having your hundreds or thousands of  readers to browse through this book  00:01:59.040 --> 00:02:03.270 and waste your time, looking for a  fact whose location you already know.  00:02:03.270 --> 00:02:06.300 Because you wouldn't be citing  the book unless you have read it.  00:02:06.300 --> 00:02:13.580 Then they explained that are the correlation is  not sufficiently significant and they different   00:02:13.580 --> 00:02:18.020 models and the results were unchanged and they  conclude that suppression is not a problem.  00:02:18.020 --> 00:02:23.630 I agree with the explanation that the suppression  is not a problem but not for the reasons that they   00:02:23.630 --> 00:02:28.940 explained so suppressant effect is not something  that we need to curse, it's a problematic it's   00:02:28.940 --> 00:02:32.510 it's a feature of regression analysis. So let's take a look at our   00:02:32.510 --> 00:02:37.430 the the their actual statistics. So what are the numbers that they refer to.  00:02:37.430 --> 00:02:45.650 So they identified that the correlation between  our physics and age and our patient satisfaction   00:02:45.650 --> 00:02:50.870 is positive and their corresponding  regression coefficient is negative.  00:02:50.870 --> 00:02:57.350 So why could that be the case? We have to remember that correlation and   00:02:57.350 --> 00:03:02.870 regression coefficient quantify different things. So correlation and regression coefficient ideally   00:03:02.870 --> 00:03:06.710 quantifies the causal relationship  under certain assumptions.  00:03:06.710 --> 00:03:10.910 Correlation coefficient  quantifies a linear association,   00:03:10.910 --> 00:03:18.050 that could be causal or it could be spurious. It's very simple to see here why the physician   00:03:18.050 --> 00:03:23.150 aids is correlated positively with satisfaction  but why the regression coefficient is negative.  00:03:23.150 --> 00:03:28.550 We just need to look at the correlation table. So let's take a look at the correlation table.  00:03:28.550 --> 00:03:34.790 We first look at which fact which  variables are highly correlated with aids.  00:03:34.790 --> 00:03:40.520 Well it's the ten year, so ten years  correlate with age at very high level.  00:03:40.520 --> 00:03:47.480 Then we look at what's the regression coefficient  of ten year here it's it's very strong positive,  00:03:47.480 --> 00:03:53.720 so the more experience you have the  more satisfied your patients are.  00:03:53.720 --> 00:03:59.000 Also experience correlates with age,  which is quite natural because if you   00:03:59.000 --> 00:04:05.120 are like 25 nearly graduated medical  doctor you can't have much experience,  00:04:05.120 --> 00:04:11.870 if you are someone with 30 years of world  experience as a doctor you must be more   00:04:11.870 --> 00:04:16.370 than 50 because normally you are more than  20 when you graduate from medical school.  00:04:16.370 --> 00:04:20.720 So age and 10-year age and work  experience natural correlate very highly.  00:04:20.720 --> 00:04:27.560 So what's going on, remember that the  linear model implies equal lesson matrix,   00:04:27.560 --> 00:04:38.450 so what is the implied correlation between our age  and patient satisfaction based on the correlation   00:04:38.450 --> 00:04:45.320 between tenure and the effects of tenure and age. So we go from age to the patient satisfaction.  00:04:45.320 --> 00:04:54.770 We take that path once minus 13 and we  take the correlation path point 69 times   00:04:54.770 --> 00:05:00.710 three point four this correlation path. So that gives us some some math we get   00:05:00.710 --> 00:05:06.650 that the implied correlation based on this  part of the model only is our zero point one   00:05:06.650 --> 00:05:10.760 which is very close to the zero point zero  nine which they're positive correlation.  00:05:10.760 --> 00:05:19.940 So what why is there are different sign. It's pretty straightforward and it's we   00:05:19.940 --> 00:05:24.200 have a natural explanation. When regression coefficients   00:05:24.200 --> 00:05:30.140 quantify they are effect of one variable  when other variables are held constant.  00:05:30.140 --> 00:05:36.140 So what the regression coefficient tells  us that when you have two physicians that   00:05:36.140 --> 00:05:40.550 have equal among the work experience,  people tend to prefer the younger one.  00:05:40.550 --> 00:05:44.630 That is natural. But people also tend   00:05:44.630 --> 00:05:51.680 to prefer doctors with more experience and those  doctors are older tend to have more experience,   00:05:51.680 --> 00:05:59.300 and the experience is the variable that matters  more than the age. So the correlation here   00:05:59.300 --> 00:06:06.290 0.09 reflects the effect of age itself  which is negative based on this model,   00:06:06.290 --> 00:06:14.000 and a spurious effect due to are those  doctors that are have more experience   00:06:14.000 --> 00:06:20.120 are also older and receive better scores. So this is a this correlation to some spirits   00:06:20.120 --> 00:06:26.690 effect and a direct effect and in this case the  spurious effect due to correlation within ten   00:06:26.690 --> 00:06:32.690 year and the effect of ten year, which is strong,  is a lot stronger than the directly effect of age.  00:06:32.690 --> 00:06:38.780 Therefore we get positive or less. So that's how regression analysis works,   00:06:38.780 --> 00:06:45.500 it try it gets a correlation and it tries  to identify how much of that correlation   00:06:45.500 --> 00:06:50.210 is furious how ,much of that correlation  corresponds to a courser relationship.  00:06:50.210 --> 00:06:59.030 Sometimes the spurious part is a lot larger than  or than the actual causal effect part and that   00:06:59.030 --> 00:07:04.940 can cause the correlation coefficient to have a  different sign than the correlation coefficient.  00:07:04.940 --> 00:07:08.600 It is not a problem it is how  regression analysis works.