1

Hello Stack Overflow people, I have spent a while searching for a solution to my problem, but not found anything, so thought I would post.

Basically I have a dataset of 196 countries listed in alphabetical order. One of the variables assigns a number from 1-10 depending on what region the country is in. For example Eastern Europe = 1, Western Europe = 2, Middle East = 3, South America = 4 and so on.

Here is a visual representation of the dataset:

Country Name------Country Region------Infant Mortality Rate

Afghanistan------------3------------------------180

Argentina---------------4------------------------65

France------------------2------------------------12

Germany---------------2------------------------10

Poland------------------1-----------------------16

What I need to do is split up the 10 regions into their own individual dummy variables, in order for me to run them through a multivariate regression to determine the individual effect they have on infant mortality rate.

I was wondering what the necessary code would be to create the dummy variables (1 = Eastern Europe, 0 = Other etc) and then how to test their effect both individually and in a multivariate regression.

Sorry if this seems simple or a stupid question, I am fairly new to using R.

Thanks for the help in advance.

Edit: This is the dput output as requested:

structure(list(Country.Name = structure(c(1L, 2L, 3L, 4L, 5L, 
6L, 11L, 7L, 9L, 10L, 12L, 13L, 14L, 8L, 15L, 17L, 20L, 21L, 
22L, 23L, 24L, 18L, 156L, 25L, 26L, 120L, 28L, 16L, 29L, 30L, 
31L, 32L, 33L, 160L, 34L, 35L, 36L, 170L, 37L, 38L, 39L, 40L, 
41L, 43L, 44L, 45L, 46L, 19L, 47L, 49L, 50L, 51L, 53L, 54L, 57L,
55L, 56L, 58L, 59L, 60L, 48L, 61L, 63L, 62L, 64L, 65L, 88L, 66L,
67L, 68L, 69L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L,
81L, 82L, 42L, 83L, 84L, 86L, 85L, 87L, 89L, 90L, 91L, 92L, 93L,
95L, 96L, 94L, 97L, 98L, 99L, 100L, 101L, 103L, 104L, 105L, 106L, 
107L, 108L, 110L, 111L, 112L, 115L, 116L, 114L, 117L, 118L, 119L, 
130L, 121L, 122L, 123L, 124L, 189L, 125L, 126L, 127L, 128L, 129L, 
113L, 109L, 132L, 131L, 133L, 134L, 135L, 136L, 137L, 138L, 139L, 
70L, 174L, 140L, 141L, 142L, 143L, 161L, 162L, 163L, 145L, 146L, 
147L, 148L, 149L, 151L, 152L, 153L, 154L, 191L, 155L, 157L, 158L, 
194L, 159L, 164L, 165L, 166L, 167L, 168L, 169L, 171L, 173L, 175L, 
176L, 177L, 184L, 178L, 179L, 180L, 181L, 182L, 183L, 102L, 52L, 
185L, 172L, 186L, 27L, 187L, 188L, 190L, 144L, 192L, 150L, 193L
), .Label = c("Afghanistan", "Albania", "Algeria", "Andorra", 
"Angola", "Antigua and Barbuda", "Argentina", "Armenia", "Australia", 
"Austria", "Azerbaijan", "Bahamas", "Bahrain", "Bangladesh", 
"Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan", 
"Bolivia", "Bosnia and Herzegovina", "Botswana", "Brazil", "Brunei", 
"Bulgaria", "Burkina Faso", "Burundi", "Cambodia", "Cameroon", 
"Canada", "Cape Verde", "Central African Republic", "Chad", "Chile", 
"China", "Colombia", "Comoros", "Congo", "Congo, Democratic Republic", 
"Costa Rica", "Cote d'Ivoire", "Croatia", "Cuba", "Cyprus", "Czech Republic",  
"Denmark", "Djibouti", "Dominica", "Dominican Republic", "Ecuador", 
"Egypt", "El Salvador", "Equatorial Guinea", "Eritrea", "Estonia", 
"Ethiopia", "Fiji", "Finland", "France", "Gabon", "Gambia", "Georgia", 
"Germany", "Ghana", "Greece", "Grenada", "Guatemala", "Guinea",
"Guinea-Bissau", "Guyana", "Haiti", "Honduras", "Hungary", "Iceland", 
"India", "Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy", 
"Jamaica", "Japan", "Jordan", "Kazakhstan", "Kenya", "Kiribati", 
"Korea, North", "Korea, South", "Kuwait", "Kyrgyzstan", "Laos", 
"Latvia", "Lebanon", "Lesotho", "Liberia", "Libya", "Liechtenstein", 
"Lithuania", "Luxembourg", "Macedonia", "Madagascar", "Malawi", 
"Malaysia", "Maldives", "Mali", "Malta", "Marshall Islands", 
"Mauritania", "Mauritius", "Mexico", "Micronesia", "Moldova", 
"Monaco", "Mongolia", "Montenegro", "Morocco", "Mozambique",
"Myanmar", "Namibia", "Nauru", "Nepal", "Netherlands", "New Zealand", 
"Nicaragua", "Niger", "Nigeria", "Norway", "Oman", "Pakistan", 
"Palau", "Panama", "Papua New Guinea", "Paraguay", "Peru", "Philippines", 
"Poland", "Portugal", "Qatar", "Romania", "Russia", "Rwanda", 
"Samoa", "San Marino", "Sao Tome and Principe", "Saudi Arabia", 
"Senegal", "Serbia", "Serbia and Montenegro", "Seychelles", "Sierra Leone", 
"Singapore", "Slovakia", "Slovenia", "Solomon Islands", "Somalia", 
"South Africa", "Spain", "Sri Lanka", "St Kitts and Nevis", "St Lucia", 
"St Vincent and the Grenadines", "Sudan", "Suriname", "Swaziland", 
"Sweden", "Switzerland", "Syria", "Taiwan", "Tajikistan", "Tanzania", 
"Thailand", "Timor-Leste", "Togo", "Tonga", "Trinidad and Tobago", 
"Tunisia", "Turkey", "Turkmenistan", "Tuvalu", "Uganda", "Ukraine", 
"United Arab Emirates", "United Kingdom", "United States", "Uruguay", 
"Uzbekistan", "Vanuatu", "Venezuela", "Vietnam", "Yemen", "Zambia", 
"Zimbabwe"), class = "factor"), Country.Region = c(8L, 1L, 3L, 
5L, 4L, 10L, 1L, 2L, 5L, 5L, 10L, 3L, 8L, 1L, 10L, 5L, 8L, 2L, 
1L, 4L, 2L, 10L, 9L, 7L, 1L, 7L, 4L, 1L, 7L, 4L, 5L, 4L, 4L, 
8L, 4L, 2L, 6L, 6L, 2L, 4L, 4L, 4L, 2L, 1L, 2L, 3L, 1L, 4L, 5L, 
10L, 2L, 2L, 2L, 4L, 4L, 4L, 1L, 9L, 5L, 5L, 4L, 4L, 1L, 4L, 
5L, 4L, 9L, 5L, 10L, 2L, 4L, 10L, 2L, 2L, 1L, 5L, 8L, 7L, 3L, 
3L, 5L, 3L, 5L, 4L, 10L, 6L, 1L, 3L, 4L, 6L, 6L, 3L, 1L, 7L, 
3L, 4L, 1L, 4L, 3L, 5L, 1L, 5L, 4L, 4L, 7L, 8L, 4L, 5L, 4L, 4L, 
2L, 5L, 6L, 1L, 1L, 3L, 4L, 3L, 4L, 9L, 8L, 5L, 9L, 5L, 2L, 4L, 
4L, 5L, 9L, 9L, 9L, 8L, 2L, 9L, 2L, 2L, 7L, 1L, 5L, 4L, 7L, 3L, 
1L, 1L, 4L, 10L, 10L, 10L, 5L, 4L, 3L, 4L, 1L, 4L, 4L, 7L, 1L, 
7L, 1L, 4L, 4L, 4L, 5L, 4L, 10L, 4L, 5L, 5L, 3L, 1L, 7L, 4L, 
9L, 10L, 3L, 3L, 3L, 1L, 9L, 4L, 1L, 1L, 3L, 5L, 4L, 5L, 4L, 
2L, 1L, 2L, 9L, 3L, 1L, 4L), Under.5.Mortality.Rate = c(137.3500061, 
20.40999985, 30.80999947, 6.579999924, 178.6000061, 22.02000046, 
51.13999939, 20.05999947, 6.059999943, 5.46999979, 19.12000084, 
11.18999958, 79.55999756, 28.54000092, 19.89999962, 5.639999866, 
79.80999756, 56.77999878, 9.569999695, 58.18000031, 28.07999992, 
29.54999924, 34.72999954, 9.199999809, 15.46000004, 72.59999847, 
145.4600067, 14.72000027, 85.63999939, 132.8600006, 6.480000019, 
42.68000031, 150.5, 15.02999973, 185.2100067, 10.13000011, 27.06999969, 
7.619999886, 22.79000092, 78.52999878, 113.0199966, 165.1199951, 
13.39999962, 7.949999809, 7.730000019, 5.590000153, 6.460000038, 
128.3200073, 5.489999771, 20.05999947, 35.97000122, 31.18000031, 
30.44000053, 180.1799927, 126.4899979, 95.69000244, 9.210000038, 
30.03000069, 4.010000229, 4.949999809, 83.83000183, 80.19999695, 
31.62000084, 110.2300034, 4.889999866, 93.91000366, 56.91999817, 
6.400000095, 20.76000023, 45.81999969, 163.9900055, 44.61000061, 
90.98000336, 33.29999924, 9.079999924, 3.730000019, 79.45999908, 
46.09999847, 43.70999908, 39.90000153, 6.769999981, 6.690000057, 
5.730000019, 123.8099976, 23.86000061, 4.079999924, 39.5, 21.85000038, 
96.69000244, 44.25, 8.93999958, 11.47000027, 49.27000046, 91.37999725, 
12.98999977, 105.8600006, 12.56000042, 151.6100006, 19.94000053, 
NA, 9.520000458, 4.71999979, 94.59999847, 129.8999939, 8.5, 28.04000092, 
199.6399994, 6.849999905, 97.87999725, 16.95999908, 23.54999924, 
NA, 52.43000031, 19.60000038, 12.39999962, 46.31000137, 150.2299957, 
14.13000011, 61.52000046, NA, 69.44999695, 6.139999866, 32.08000183, 
7.300000191, 36.22999954, 205.1699982, 172.3699951, 4.639999866, 
34.79000092, 45.15000153, NA, 90.91000366, 22.34000015, 90.08000183, 
26.45000076, 36.41999817, 36.63000107, 8.770000458, 6.639999866, 
183.9799957, 79.04000092, 12.32999992, 19.40999985, 19.03000069, 
142.6300049, NA, 16.13999939, 23.86000061, NA, 67.91999817, 20.27000046, 
115.0800018, 7.21999979, 17.17000008, 184.1300049, 3.680000067, 
9.020000458, 17.10000038, 4.900000095, 137.2100067, 42.95999908, 
74.22000122, 5.260000229, 101.0999985, 42.54000092, 106.4199982, 
4.489999771, 5.639999866, 16.70999908, 73.68000031, 12.68999958, 
109.0500031, 21.54000092, 32.61999893, 5.940000057, 24.13999939, 
37.29999924, 61.74000168, NA, 134.8099976, 18.44000053, 17.59000015, 
40.22000122, 6.190000057, 115.1299973, 8.050000191, 161.7100067, 
15.60999966, 54.25, 21.29999924, 23.29999924, 85.33999634, NA, 
133.4700012)), .Names = c("Country.Name", "Country.Region", "Under.5.Mortality.Rate"
), class = "data.frame", row.names = c(NA, -194L))
user2292364
  • 11
  • 1
  • 3

2 Answers2

0

Since you're goal is to generate models for each Region, I think what you want is the plyr function, which you'd need to install as a package. It may sound daunting to a new R user, but it is definitely worth learning, since it so easily allows you to perform analyses on subsets of data.

In your case, your code would look like:

require(plyr)
models = ddply(df, .(CountryRegion), lm, formula = y ~ x1 + x2)

In which df is your dataframe, CountryRegion is the categorical variable that you're using to sub-set the data, lm (or an alternative) is the modeling function, and x1, x2, etc are your independent variables. As a result, you'd have individual models for each CountryRegion.

Hope this helps, and make sure you check out plyr!

  • I don't think that's what they want. – Roland Apr 19 '13 at 17:01
  • The goal was to "to test their effect both individually AND in a multivariate regression," so to test them individually, you'd have to subset by region and calculate a model for each. Then, to compare them against each other, you'd compare their coefficients from the multivariate regression, using the process you described in having the regression function handle the categorical variables automatically. – I_hunt_bugs Apr 19 '13 at 17:47
  • I guess we have different definitions of "multivariate". – Roland Apr 19 '13 at 20:08
  • I presume the original poster had other variables to include as well, both for the single-region and multi-region models, otherwise this is just a contingency table. – I_hunt_bugs Apr 19 '13 at 22:03
  • @I_hunt_bugs this is not to say contingency tables cannot be modeled themselves. – Maxim.K Apr 20 '13 at 21:07
0

As far as I could understand the question, you are looking to set contrasts for your 10-category factor. One of the ways to set contrasts is indeed dummification. If you have 10 categories for a single nominal variable, you can create 9 dummy variables to represent the same information in a multivariate regression. If this is the case, what you are looking for is contrasts():

x <- factor(1:10,ordered=FALSE)
contrasts(x) <- contr.treatment(10)
print(x)

Then you simply use x in your regression model, R will recognize the contrast matrix and apply it accordingly. In the output you will see the levels you have set.

You can specify which category will be used as reference by using the base= option in contr.treatment. The interpretation of the regression coefficients is then a comparison of each dummy variable (mean for the respective category) with the reference category (mean for the base level).

There are also alternative ways of setting contrasts. For countries it may be useful to compare each country to the grand mean, rather than to the mean of one specific country. This would be achieved with the deviation coding using contr.sum(10). The interpretation of your effects changes, so does the meaning of the intercept. In general, it is very useful to know about different contrasting methods as they make a simple tool like regression more flexible and powerful. Additional information on various coding systems in R can be found here. Good luck with your analysis.

P.S. You can also specify a contrast matrix by hand, which is useful for deviation coding, as it has no base option.

Maxim.K
  • 4,120
  • 1
  • 26
  • 43
  • This has worked, I'm just having trouble understanding how a comparison between the different regions can be carried out when there isn't a different output for each one. Unless I have missed something here – user2292364 Apr 21 '13 at 13:37
  • @user2292364 I am not sure what you mean. Please post your output and then specify what exactly is not clear. – Maxim.K Apr 21 '13 at 13:50
  • The output obtained after I ran the above code was the same as just including the region variable without altering it, will copy the output over in a sec – user2292364 Apr 21 '13 at 14:56
  • If you mean that the regression output (e.g. `summary(yourmodel)` is the same, it is because `lm` (which I assume you use) will automatically apply contrasts to your factor. The default is `contr.treatment` I believe, with base = 1. If this is ok for you, no additional manipulations are required. My answer is helpful if you need to change something from default behavior (or have a general understanding of what is going on with your model). – Maxim.K Apr 21 '13 at 15:11