2

I am doing linear models in R. My factors include birth rates, death rates, infant mortality rates, life expectancies, and region. region has 7 levels, using numerical numbers to represent each region:

  1. East Asia & Pacific
  2. South Asia
  3. Europe & Central Asia
  4. North America
  5. Latin America
  6. Middle East & North Africa
  7. Sub-Saharan Africa

I ran a Lasso Regression in R to try to improve the generalized linear model. The Lasso Regression coefficients is as follows:
enter image description here

I will put the factors selected by Lasso Regression into the lm function in R:

Lasso.lm <- lm(log(GNIpercapita) ~ deathrate + infantdeaths + life.exp.avg + 
                                    life.exp.diff + region, data=econdev) 

However, for regions, how do I add each region into the linear model lm? For example, regionEast Asia & Pacific, I can't jut add as + regionEast Asia & Pacific.

user20650
  • 24,654
  • 5
  • 56
  • 91
  • You can create a dummy variable: 1 if 0 otherwise. And just add that to your regression with `+ myDummy`. – FatihAkici Nov 12 '20 at 03:37
  • I guess I did not frame this question very well. I want to find the adjusted R squared which tells me the predictability. the adjusted R squared I get for the generalized model is 0.8298. I am not sure how to interpret this lasso regression result. How do I tell the predictability of a lasso regression? I am trying to improve my linear model by using different methods. –  Nov 12 '20 at 04:20

2 Answers2

0

You cannot use pieces and parts of the category.

You can eliminate numerical variables, or entire columns of categorical variables, but you cannot pick and choose individual categories because it fragments your dataframe.

You might be better off to use the outcome of the Lasso Regression itself and predict from it. It is not less of a regression because of the regularization. It is more complex, and more robust and less straight forward, but not 'worse'.

If that does not work for you, then you can run an lm() with the continuous variables selected and the entire region variable and accept that the model is imperfect as all models are or remove the region and settle for what may be a less predictive model.

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • They could add individual levels in by creating a binary dummy variable so when region == East Asia & Pacific it is one otherwise zero. – user20650 Nov 12 '20 at 02:49
  • Yes you can create categories for `other` and such, to aggregate regions, you can certainly combine all of the remaining variables levels into one. But you will likely not get the accuracy you want from the `lm()` by doing this that you get from using a regularized model. – sconfluentus Nov 12 '20 at 03:06
  • My comment was intended to illustrate how your first two paragraphs are not correct. But there are modelling options however they are probably better off being discussed at crossvalidated -- besides they have not added enough details to their question to indicate that lasso regression is required. – user20650 Nov 12 '20 at 03:13
  • I see what you are saying @user20650 . The questions left a lot of assuming, which I am admittedly poor at. And I do tend to answer with a best practices context as opposed to can it be done. Always good to have a foil to reflect that other side. – sconfluentus Nov 12 '20 at 03:27
  • You can very well use pieces and parts of a categorical variable. If the logic makes sense when dividing a categorical variable into certain two groups, it is perfectly fine to create a dummy variable accordingly. – FatihAkici Nov 12 '20 at 03:40
  • @FatihAkici Ye, but you have to use all of the categories, you cannot just throw some away....that was my point. You cannot use 3 categories, group 2 and throw 2 away. – sconfluentus Nov 12 '20 at 03:50
  • @sconfluentus; you can just use three categories - just create three binary variables. Then each coefficient gives a measure of that level against all of the others (i.e. East Asia & Pacificl vs not East Asia & Pacific). – user20650 Nov 12 '20 at 10:44
  • I completely understand that...but you cannot summarize only part of the categories...your binaries have to summarize all of the levels in some way. whether is is grouping some, and individuals of others...but you cannot leave some level out altogether – sconfluentus Nov 13 '20 at 03:05
0

I agree with previous comments in that it is not recommended to pick and choose parts of a categorical variable. If you would still like to do it, it is easy using the modeldb package to create dummy variables for each level of your categorical variable. Remember in your regression lm() you have to leave one level of the categorical variable out to avoid perfect collinearity.

library(modeldb)

df %>% 
  add_dummy_variables(region)
benjasast
  • 87
  • 5
  • Why is it not recommended? There is no such a thing. If the logic makes sense when dividing a categorical variable into certain two groups, it is perfectly fine to create a dummy variable accordingly. – FatihAkici Nov 12 '20 at 03:39
  • It depends on the use OP is giving to the regression. If OP is looking to make some type of causal inference using the regression coefficients, then excluding parts of a categorical variable does not make sense. If the objective is a model for prediction, then it might make sense. – benjasast Nov 12 '20 at 03:53
  • I guess I did not frame this question very well. I want to find the adjusted R squared which tells me the predictability. the adjusted R squared I get for the generalized model is 0.8298. I am not sure how to interpret this lasso regression result. How do I tell the predictability of a lasso regression? –  Nov 12 '20 at 04:13