GLM: Continuous variable with few states as factor or numeric?

Question

I have a basic question. I am running binomial GLMs, with numeric predictors. Some of these predictors have very few unique values - some have 2, some 3 and some have 4. All these predictors are on a clear and interpretable continuous scale - I just sampled a lot of times from very few places on the scale (I know, not ideal for regression, but cannot be changed). Take for example the following table. Imagine this table is repeated like this for 10'000 more times, with just the response values varying:

response	pred1	pred2	pred3
0	20	100	100
1	50	900	200
1	20	4000	800
0	50	100	900
1	20	900	100
0	50	4000	100
1	20	100	800
0	50	900	900

My question is: (when) does it make sense to translate these predictors into factors? If a numeric variable only contains 2 unique values, does it even make a difference if it's a factor or numeric? Can I trust estimates based on just 3 or 4 unique values? Would it be better to make it a factor and thereby "acknowledge" that we cannot infer a linear regression line from the few values we have sampled?

I assume, since they can all be placed on a continuous scale, it makes sense to keep them numeric, but I just wanted to make sure I'm doing the right thing.

It depends. If you suspect a linear relationship between the predictor and the log odds of your outcome then you should probably just keep them numeric. However, imagine sampling people where the only ages in your sample were 5 year olds, 40 year olds and 80 year olds. If your outcome was the probability of having a part-time job you might be best converting age to a factor. If your sample contained 20, 30, and 40 year olds and you are modelling probability of being a millionaire, then perhaps keeping it numeric would be better. — Allan Cameron, Dec 14 '20 at 19:02
Yes, I do expect a linear relationship. Thanks, your comment helped!! — Alex_H, Dec 16 '20 at 14:45
Maybe as a side question: would it help anything making them ordinal factors as kind of an intermediate solution? — Alex_H, Dec 16 '20 at 14:47
You will probably get a better model fit doing it that way, unless the conditional means at each group are exactly linear (which is very unlikely in real life data) — Allan Cameron, Dec 16 '20 at 15:51

GLM: Continuous variable with few states as factor or numeric?

0 Answers0