13

I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:

1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)

As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.

Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.

This did not work:

fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
smci
  • 32,567
  • 20
  • 113
  • 146
Tim_Utrecht
  • 1,459
  • 6
  • 24
  • 44

3 Answers3

12

Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.

What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.

Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.

Maxim.K
  • 4,120
  • 1
  • 26
  • 43
  • Thanks Maxim, this is not disappointing for me. I'm glad that there is an other way of solving it. Thank you. – Tim_Utrecht Mar 05 '14 at 09:44
  • or ordinal regression (`MASS::polr()`, `ordinal` package among others) – Ben Bolker Jun 08 '16 at 11:17
  • Hello; I believe that the lm function deals with categorical variables now, by making a coefficient and a binary variable for each category. However, I am concerned about your sentence: "Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really". Does this mean that lm()'s current handling of categorical variables is just ad hoc and doesn't work very well for predictions in general? – Ovi Apr 08 '18 at 20:52
  • @Ovi: it means that **linear regression** is not designed to handle categorical responses. As @MaximK says, it doesn't have anything to do with `lm()` or R: any linear regression procedure will fail (*or* naively convert the categorical variable to integer values, which is either questionable (if the variable is ordered) or completely wrong (if the variable is unordered) – Ben Bolker Jan 17 '21 at 00:21
4

Expanding a little bit on @MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
1

If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:

library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
               c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))

my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))

This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.

Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.

Hope this helps!

saladin1991
  • 142
  • 9
  • This is reasonable but makes a very strong assumption (that the levels of the response are evenly spaced) which may or may not be justified – Ben Bolker Jan 17 '21 at 00:22