WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:03.691 Categorical variables are variables  that don't really have any order. 00:00:03.691 --> 00:00:05.785 So, for example, a country 00:00:05.785 --> 00:00:07.302 would be a categorical variable 00:00:07.302 --> 00:00:09.808 with values of Finland, Sweden and Norway. 00:00:10.210 --> 00:00:14.677 Our Prestige dataset has a categorical  variable of occupation type, 00:00:14.811 --> 00:00:16.281 and the categories are 00:00:16.281 --> 00:00:18.600 blue color, white color and professional workers. 00:00:19.015 --> 00:00:21.235 So, how do we deal with this kind of variables? 00:00:21.235 --> 00:00:24.930 When you have a categorical  variable as an independent variable, 00:00:24.930 --> 00:00:27.189 things are relatively straightforward. 00:00:27.189 --> 00:00:30.865 When you have a categorical  variable as a dependent variable 00:00:30.865 --> 00:00:33.290 then things will get a bit more complicated. 00:00:33.290 --> 00:00:38.370 We will now cover the case of a categorical variable as an independent variable. 00:00:39.066 --> 00:00:42.240 So our data looks like that. 00:00:42.240 --> 00:00:43.890 So our prestige, 00:00:43.890 --> 00:00:47.548 when we take a summary of the prestige  data set the variable type, 00:00:48.000 --> 00:00:49.947 we can see that there are frequencies. 00:00:49.947 --> 00:00:53.072 So there are no means or  standard deviations or such, 00:00:53.072 --> 00:00:55.107 just frequencies of different values. 00:00:55.576 --> 00:00:57.960 We have 44 blue color professions, 00:00:57.960 --> 00:01:00.060 31 professional professions, 00:01:00.060 --> 00:01:03.420 and 23 white color occupations, 00:01:03.420 --> 00:01:06.060 and then we have four missing values. 00:01:06.944 --> 00:01:09.330 So how do we deal with that  in a regression analysis? 00:01:09.330 --> 00:01:14.279 We can't put that as an  independent variable because 00:01:14.935 --> 00:01:16.950 a unit difference doesn't make a sense. 00:01:16.950 --> 00:01:18.912 We can't say that 00:01:18.912 --> 00:01:22.590 the difference between blue color  to a professional is one unit, 00:01:22.590 --> 00:01:25.260 difference between a professional  and white color is one unit, 00:01:25.260 --> 00:01:29.730 and the difference between blue  color and white color is two units. 00:01:29.730 --> 00:01:31.380 It doesn't make sense, 00:01:31.380 --> 00:01:32.889 because we don't know, 00:01:33.130 --> 00:01:34.860 what is the difference between, 00:01:34.860 --> 00:01:38.760 we can't say that there is a magnitude  of difference between these values, 00:01:38.760 --> 00:01:40.290 and we can't say that there's an order. 00:01:40.544 --> 00:01:41.984 So how do we deal with that? 00:01:41.984 --> 00:01:44.880 We use something called dummy variables. 00:01:44.880 --> 00:01:47.370 So we code our data like that, 00:01:47.370 --> 00:01:49.461 so that's a subset of our data. 00:01:49.648 --> 00:01:52.198 And we have the variable type here. 00:01:52.198 --> 00:01:55.650 Then each observation in the data set 00:01:56.970 --> 00:01:59.910 gets a code for one of the dummy variables. 00:01:59.910 --> 00:02:05.578 So dummy variables are type blue color,  type professional and type white color. 00:02:05.578 --> 00:02:11.281 And then the dummy indicates that this first occupation is a professional occupation. 00:02:11.281 --> 00:02:15.224 So type professional gets one, 00:02:15.358 --> 00:02:16.978 others get zeros. 00:02:16.978 --> 00:02:21.280 Then we have, a type blue color is one, 00:02:21.280 --> 00:02:23.200 for this blue color occupation, 00:02:23.200 --> 00:02:23.941 others are zero. 00:02:24.115 --> 00:02:26.280 So these are dummy or indicator variables, 00:02:26.280 --> 00:02:32.370 and they indicate, which occupation or  category each occupation belongs to. 00:02:33.321 --> 00:02:36.681 Then when we add the dummies  to regression analysis. 00:02:36.815 --> 00:02:39.480 Stata or R can do that automatically for you, 00:02:39.480 --> 00:02:42.150 so you don't have to do coding manually. 00:02:42.150 --> 00:02:43.440 If you want to use SPSS, 00:02:43.440 --> 00:02:45.090 you have to manually create the dummies. 00:02:45.291 --> 00:02:49.020 So SPSS doesn't simplify your  life that much in that regard. 00:02:49.234 --> 00:02:51.591 Then let's take a look at the regression results, 00:02:51.591 --> 00:02:53.646 when you add a categorical variable 00:02:53.646 --> 00:02:56.487 in regression analysis in R and Stata, 00:02:57.759 --> 00:03:04.350 then we have the categorical variable here. 00:03:04.859 --> 00:03:08.459 R will automatically note that  this is a categorical variable 00:03:08.459 --> 00:03:11.580 and it will produce two dummy variables, 00:03:11.580 --> 00:03:15.674 type professionals and type white colors. 00:03:16.263 --> 00:03:18.390 So how do we interpret those results, 00:03:18.390 --> 00:03:19.860 and where's the blue color professional? 00:03:20.529 --> 00:03:23.555 Well the first thing that we  need to understand is that 00:03:23.555 --> 00:03:28.827 every time, when you have a categorical  variable and dummy variables to 00:03:28.934 --> 00:03:30.870 use that in a regression analysis, 00:03:30.870 --> 00:03:33.210 one of those categories is left out. 00:03:34.134 --> 00:03:37.164 So we are leaving out 00:03:37.164 --> 00:03:41.160 the blue color category here. 00:03:41.160 --> 00:03:43.305 And now these effects are, 00:03:43.386 --> 00:03:48.685 what is the average prestige of professional, 00:03:48.685 --> 00:03:50.610 how much is the average difference between a   00:03:50.610 --> 00:03:54.630 professional occupation and  a blue color occupation, 00:03:54.978 --> 00:03:56.418 how much is the difference between 00:03:56.418 --> 00:03:59.400 white color occupation and  a blue color occupation? 00:03:59.400 --> 00:04:05.280 So these regression coefficients refer  to differences between the occupations. 00:04:05.280 --> 00:04:12.660 And one occupation is always used or  one category in the categorical variable 00:04:12.660 --> 00:04:14.121 is used as a reference category. 00:04:14.576 --> 00:04:19.921 So these can be interpreted  only against the blue color. 00:04:19.921 --> 00:04:25.818 If we want to compare a blue  color or professional white color 00:04:25.818 --> 00:04:30.000 then we can just indicate manually, which one, 00:04:30.000 --> 00:04:31.784 or manually include the dummies, 00:04:31.784 --> 00:04:33.284 or indicate manually, 00:04:33.284 --> 00:04:35.430 which of these categories is left out. 00:04:35.430 --> 00:04:37.050 That's more advanced. 00:04:38.376 --> 00:04:42.837 One more thing that we note  here for the first time is that 00:04:42.837 --> 00:04:46.230 R tells that there are missing observations 00:04:46.471 --> 00:04:47.761 from the data. 00:04:47.761 --> 00:04:50.670 So we got, four observations were missing, 00:04:50.670 --> 00:04:53.280 because they didn't have a type variable. 00:04:53.682 --> 00:04:58.050 Quite often when you have some missing data then 00:04:58.050 --> 00:05:01.170 the default action is just to omit those cases, 00:05:01.170 --> 00:05:03.431 for which a variable doesn't have any values. 00:05:03.645 --> 00:05:05.644 There are other more advanced techniques, 00:05:05.644 --> 00:05:09.840 but if the number of  observations that you drop is small 00:05:09.840 --> 00:05:12.000 compared to the overall number of data 00:05:12.000 --> 00:05:14.700 then dropping the cases doesn't really matter.