WEBVTT WEBVTT Kind: captions Language: en 00:00:00.060 --> 00:00:06.180 We will next cover a couple of basic statistical  concepts that are related to data. So descriptive   00:00:06.180 --> 00:00:13.140 statistics are some things, some numbers that we  calculate from our data. They are like summaries   00:00:13.140 --> 00:00:19.860 of our data. To understand this basic concept  is important, because quite often our more   00:00:19.860 --> 00:00:27.000 complicated models try to explain differences in  mean, or try to explain variation assume something   00:00:27.000 --> 00:00:33.570 about the variance and so on. So to understand  what these more complicated things do we have to   00:00:33.570 --> 00:00:40.800 understand the basics. And this is partly high  school mathematics but it's useful to revise it   00:00:40.800 --> 00:00:47.490 now before we go into more complicated things.  There are first important thing to know is the   00:00:48.690 --> 00:00:55.770 concept of central tendency and this person.  So these are data about three thousand one   00:00:55.770 --> 00:01:02.700 hundred seventy one working age males from the  United States. We have date on their heights.   00:01:02.700 --> 00:01:10.950 This shows the distribution of the heights, so we  have some people that are very short we have some   00:01:10.950 --> 00:01:18.090 people that are very tall, and most people fall  into these bins somewhere in the middle. So it's   00:01:18.090 --> 00:01:26.670 bar here presents are a group of people, how many  people fall into their category or for example 170   00:01:26.670 --> 00:01:32.910 to 175 centimeters of height. So the bar presents  the amount of people. And this is a histogram,   00:01:32.910 --> 00:01:40.860 it presents how these heights are distributed.  Then we have there the Kernel density plot of   00:01:40.860 --> 00:01:47.430 the same data. So Kernel density plot shows us  the distribution in another way. So we just have   00:01:47.430 --> 00:01:53.940 a line, and this is our the probability density  function, that's the what's it called formally,   00:01:53.940 --> 00:02:00.180 and the height here tells what is the relative  probability of observing a person here,   00:02:00.180 --> 00:02:08.250 versus a person here for example. The area under  the curve is always one, so if this is a the scale   00:02:08.250 --> 00:02:17.430 here is is it in the tens, then the scale here  must be 0.0 or something. So that shows us that   00:02:17.430 --> 00:02:22.290 there are most people are on them in the middle,  and then there are a few small short people and   00:02:22.290 --> 00:02:31.140 a few tall people. Now the concepts of central  tendency tells us are where this distribution   00:02:31.140 --> 00:02:41.580 is located at. So are the people roughly disputed  around 175 centimeters or are they perhaps about   00:02:41.580 --> 00:02:51.300 160 centimeters or 180 centimeters. So it tells  us what is the location that's an another commonly   00:02:51.300 --> 00:03:00.030 used term for or for this where this distribution  is actually at on the axis. We have two measures   00:03:00.030 --> 00:03:05.940 of central tendency that are the most important.  The mean is the most commonly used. So mean is   00:03:05.940 --> 00:03:11.760 just they are the average, you take a some  of all these peoples heights and you divide   00:03:11.760 --> 00:03:19.560 them by the number of people. Then if you have  median which is the height of a typical person.   00:03:19.560 --> 00:03:25.350 The median is calculated by putting these people  in a line, so that the Saudis person is in front   00:03:25.350 --> 00:03:31.560 and then the tallest person is in the back, and  everyone there is a ordered based on their height,   00:03:31.560 --> 00:03:36.630 and then you take the person who is right in the  middle. So it's the mid most observations value.   00:03:36.630 --> 00:03:45.780 Median is a useful statistic for quantifying  what is a typical person like in the population,   00:03:45.780 --> 00:03:52.890 because it's not sensitive to some people that  were very tall or very short. For example if we   00:03:52.890 --> 00:03:58.440 had a person here that was a 1 million centimeters  tall, which of course is impossible. Then the mean   00:03:58.440 --> 00:04:06.000 would be affected but the median wouldn't. So mean  and median tells us what is that a typical person   00:04:06.000 --> 00:04:13.500 or the typical company or whatever you're studying  like. The other important concept is this person.   00:04:13.500 --> 00:04:21.060 So this person tells us how wide this distribution  is, so is everyone about the same size or is   00:04:21.060 --> 00:04:30.060 everyone between 174 and 176 centimeters, or are  people between 150 centimeters and 2 meters. So   00:04:30.060 --> 00:04:37.380 this person tells how widely these persons  are separated. The most common or the used   00:04:37.380 --> 00:04:43.230 measure of this person is standard deviation.  I'm not going to present you the definition,   00:04:43.230 --> 00:04:49.770 but it's important to know that 1 standard  deviation is about plus or minus. One standard   00:04:49.770 --> 00:04:57.180 deviation cover about two thirds of the data,  then are plus or minus two standard deviations   00:04:57.180 --> 00:05:05.250 cover about 95% of the data. So these green lines  show the two standard deviations and then 95% of   00:05:05.250 --> 00:05:14.700 the people about fit into this area. So standard  deviation can be if standard deviation was large,   00:05:15.600 --> 00:05:21.570 here it's about seven point four centimeters. If  it was 10 centimeters it would mean that these   00:05:21.570 --> 00:05:30.000 two standard deviations would be about on bit less  than 2 meters, and this minus 2 standard deviation   00:05:30.000 --> 00:05:36.360 would be a bit more than 150 centimeters. So  it would tell us that people's heights vary   00:05:36.360 --> 00:05:43.860 more. So standard deviation tells how much the  observations vary. There is this a joke about   00:05:43.860 --> 00:05:52.440 why standard deviation is important. There are two  statisticians and one is 150 centimeters tall one   00:05:52.440 --> 00:06:00.540 is 160 centimeters tall, and they are crossing a  river that has a mean depth of 120 centimeters,   00:06:00.540 --> 00:06:07.800 and they're debating on why should they cross  or not. They decide not to because the mean   00:06:07.800 --> 00:06:13.380 doesn't tell what is the the deepest part.  So we have to understand also how much the   00:06:13.380 --> 00:06:18.720 their depth of the river varies instead of  just knowing what is the average depth of   00:06:18.720 --> 00:06:23.310 the river. So standard deviation tells us how  much variation there is in in the observations.   00:06:23.310 --> 00:06:31.260 Then there's the concept of standardization  that is also important. Standardization can   00:06:31.260 --> 00:06:38.460 be useful and it can be harmful depending on the  context but it's important to understand why we   00:06:38.460 --> 00:06:44.400 standardize and when. For example Corps Laysan,  which I have mentioned before, is a standardized   00:06:44.400 --> 00:06:50.670 measure. So it applies standardises the idea of  standardization is that you take the observations   00:06:50.670 --> 00:06:58.470 they are distributed like that and the mean is  at 175 about, and standard deviation about it's   00:06:58.470 --> 00:07:06.300 about seven centimeters. You subtract the mean  from every observation, and you divide by the   00:07:06.300 --> 00:07:12.900 standard deviation. That gives you a new variable  that has a mean of zero, and standard deviation   00:07:12.900 --> 00:07:22.290 of exactly one. So we are basically throwing away  there are data about the location and this person   00:07:22.290 --> 00:07:29.190 and we are just retaining the data on where this  each individual is located related to our other   00:07:29.190 --> 00:07:35.250 individuals, and we also retain the overall  shape of the distribution. This can sometimes   00:07:35.250 --> 00:07:47.430 make things easier to interpret. For example if I  say that ah I'm 176 centimeters tall, it may tell   00:07:47.430 --> 00:07:52.110 you something about my height if you know what  the height of other heights of population is,   00:07:52.110 --> 00:08:00.690 if I would say that my height is at the mean,  then everyone understands that typical finish   00:08:00.690 --> 00:08:07.140 males are about 50% of the time they're taller  the me and 50% of the time they're shorter than   00:08:07.140 --> 00:08:12.870 me so I'm average height. So standardization can  make things easier to interpret, but it can also   00:08:12.870 --> 00:08:18.510 make things harder to interpret depending on the  context. So standardization destroys information   00:08:18.510 --> 00:08:25.290 by eliminating an information about the they  are the central tendency, or the location and   00:08:25.290 --> 00:08:31.620 the dispersion from the data. Then there's also  variance which is another measure of dispersion,   00:08:31.620 --> 00:08:39.540 and various is related to standard deviation. It's  used because it's more convenient for some some   00:08:39.540 --> 00:08:45.510 computations and sometimes variance is easier to  interpret. For example in regression analysis we   00:08:45.510 --> 00:08:52.080 are assess how much of the variance the model  explains of the dependent variable we don't   00:08:52.080 --> 00:08:58.740 do that in standard deviation metric we do it in  very symmetric. So the standard deviation has same   00:08:58.740 --> 00:09:03.900 unit as the original variable. So if standard  deviation is seven, then we know that these   00:09:03.900 --> 00:09:12.660 bars are 7 centimeters from the mean, and if we  multiply this variance variable by 2 then standard   00:09:12.660 --> 00:09:19.890 deviation doubles, so that's that's convenient.  Then our variance measures the same thing,   00:09:19.890 --> 00:09:25.020 it measure dispersion as well but on a different  metric, and variance is defined as the mean of   00:09:25.020 --> 00:09:29.640 square differences from the mean. So we take its  observation we subtract the mean and we take a   00:09:29.640 --> 00:09:36.210 square or raised to the second power, and then  we take a mean of those squares. That gives us   00:09:36.210 --> 00:09:43.320 the various. Variance and standard deviations  are related so that the standard deviation of   00:09:43.320 --> 00:09:48.300 the data is the square root of the variance, and  variance is the square of the standard deviation.   00:09:48.300 --> 00:09:58.260 We work with either typically if you just want  to interpret how a variable is distributed. We   00:09:58.260 --> 00:10:03.480 look at the standard deviation because it sits in  a metric that is easier to understand. So standard   00:10:03.480 --> 00:10:11.460 deviation is 7 centimeters we can immediately  say that our that the people are 60% or something   00:10:11.460 --> 00:10:18.840 of the people are between our 170 and 185. So  that's how standard deviations are interested.   00:10:18.840 --> 00:10:24.570 Variance is 54 point 79, so that doesn't really  really tell us where people are located at,   00:10:24.570 --> 00:10:30.780 but variance is useful for some other purposes  and particularly in more complicated models we   00:10:30.780 --> 00:10:39.030 use variances. Sometimes you report both so that's  what possible as well. The concept of variance is   00:10:39.030 --> 00:10:46.650 important to understand the concept of covariance.  So the idea of that the variance was the mean of   00:10:46.650 --> 00:10:53.040 differences of each observation from the mean  observation to the second power. So it's the   00:10:53.040 --> 00:10:59.910 same as our difference from the mean multiplied  by difference from the mean. Then we have another   00:10:59.910 --> 00:11:09.550 statistic called covariance so here we have data  on height and weight. Height and weight. Their   00:11:09.550 --> 00:11:15.280 covariance tells us how strongly person's height  is related to the persons weight. So we can see   00:11:15.280 --> 00:11:20.500 here that are those people who tend to be tall or  taller tend to also be heavier, so there's a core   00:11:20.500 --> 00:11:28.120 covariance here. The covariance measures how much  two variables buried together and it's defined   00:11:28.120 --> 00:11:36.310 similarly to variance. Except that are you don't  multiply one variable with itself. Instead you   00:11:36.310 --> 00:11:43.510 multiply one variable with another and you take  a mean of that. Then the concept of correlation,   00:11:43.510 --> 00:11:50.860 which many of you probably know, is just the  covariance between standardized variables and   00:11:50.860 --> 00:11:58.780 correlation varies between minus 1 and plus 1. So  correlation is a measure of standardized measure   00:11:58.780 --> 00:12:04.660 of linear Association. When correlation is 1 then  you know that two things are perfectly related,   00:12:04.660 --> 00:12:11.020 when it's minus 1 you know that two things are  perfectly are negatively related. When it's zero   00:12:11.020 --> 00:12:19.480 then they are linearly unrelated. So correlation  is a measure of linear Association. That means   00:12:19.480 --> 00:12:27.340 that it measures how strongly observations are  clustered in line. So this is a scatter plot of   00:12:27.340 --> 00:12:36.730 two observations and one is a line. 0.8 is the  observations are very closely clustered on the   00:12:36.730 --> 00:12:44.350 line, then are 0.4 is something that we observe  with the plain eye. Zero means that there is no   00:12:44.350 --> 00:12:50.620 linear relationship, and then our negative  correlations means that when one observation   00:12:50.620 --> 00:12:56.980 in one variable increases, then another one  decreases. So that's the same except the   00:12:56.980 --> 00:13:04.060 directions opposite. The correlation doesn't  tell us what is the magnitude of the change,   00:13:04.060 --> 00:13:12.130 so we can say that on this is the correlation of  1. There is a huge effect of the X variable on the   00:13:12.130 --> 00:13:17.560 Y variable. This is the correlation of 1 as well  there is a small effect of X variable on the Y   00:13:17.560 --> 00:13:23.650 variable. So there the Y variable here doesn't  increase as strongly with X variable us here,   00:13:23.650 --> 00:13:29.950 so correlation doesn't tell us about the  magnitude of the effect. It just tells us   00:13:29.950 --> 00:13:35.830 how strong the association is and this is zero  correlation, because why variable doesn't vary,   00:13:35.830 --> 00:13:42.760 and then we have the negative correlations  here. Importantly correlation is a measure   00:13:42.760 --> 00:13:49.600 of linear Association, so here we have two  variables that are clearly associated. So   00:13:49.600 --> 00:13:56.170 there's a clear pattern but it's nonlinear. Here  is another pattern that's nonlinear and these are   00:13:56.170 --> 00:14:03.820 this is a weak positive correlation and this  is a clear association but it's nonlinear. So   00:14:03.820 --> 00:14:11.560 correlation only tells us if we can describe the  data with a line. There could be some other kind   00:14:11.560 --> 00:14:18.460 of relationship as well. So saying that two  variables are uncorrelated doesn't mean that   00:14:18.460 --> 00:14:26.800 they are not related statistically, just means  that relationship cannot be expressed as a line.