0

I have a basic question. I am running binomial GLMs, with numeric predictors. Some of these predictors have very few unique values - some have 2, some 3 and some have 4. All these predictors are on a clear and interpretable continuous scale - I just sampled a lot of times from very few places on the scale (I know, not ideal for regression, but cannot be changed). Take for example the following table. Imagine this table is repeated like this for 10'000 more times, with just the response values varying:

response pred1 pred2 pred3
0 20 100 100
1 50 900 200
1 20 4000 800
0 50 100 900
1 20 900 100
0 50 4000 100
1 20 100 800
0 50 900 900

My question is: (when) does it make sense to translate these predictors into factors? If a numeric variable only contains 2 unique values, does it even make a difference if it's a factor or numeric? Can I trust estimates based on just 3 or 4 unique values? Would it be better to make it a factor and thereby "acknowledge" that we cannot infer a linear regression line from the few values we have sampled?

I assume, since they can all be placed on a continuous scale, it makes sense to keep them numeric, but I just wanted to make sure I'm doing the right thing.

Alex_H
  • 23
  • 5
  • It depends. If you suspect a linear relationship between the predictor and the log odds of your outcome then you should probably just keep them numeric. However, imagine sampling people where the only ages in your sample were 5 year olds, 40 year olds and 80 year olds. If your outcome was the probability of having a part-time job you might be best converting age to a factor. If your sample contained 20, 30, and 40 year olds and you are modelling probability of being a millionaire, then perhaps keeping it numeric would be better. – Allan Cameron Dec 14 '20 at 19:02
  • Yes, I do expect a linear relationship. Thanks, your comment helped!! – Alex_H Dec 16 '20 at 14:45
  • Maybe as a side question: would it help anything making them ordinal factors as kind of an intermediate solution? – Alex_H Dec 16 '20 at 14:47
  • 1
    You will probably get a better model fit doing it that way, unless the conditional means at each group are exactly linear (which is very unlikely in real life data) – Allan Cameron Dec 16 '20 at 15:51

0 Answers0