1

I have a data set that has one column company, I will do regression modelling for this dataset.

Should I convert it using model.matrix or just assign values from 1-28 in one column.

What is the relevance of converting it to 28 columns when lm function can deal with it?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Ankit Katiyar
  • 2,631
  • 2
  • 20
  • 30
  • 1
    `lm` will perform that exact conversion under the hood. A potential advantage of converting prior to regressing would be if you were running many regressions on the same data. Performing the conversion once could speed up the process. Usually, it is better to rely on `lm`. – lmo Jul 22 '17 at 12:03

1 Answers1

1

Should I convert it using model.matrix or just assign values from 1-28 in one column?

You should do neither:

  • If you assign values from 1 to 28 in one column, it would be like saying that company 28 has 28 times the weight of company 1, whereas all the companies would need to have the same weight in your analysis (assuming these are company names that do not have an ordinal relationship).
  • Using model.matrix will convert your company column in dummy variables (0 - 1 flags), but you do not need to do that since lm will do that automatically for you.

What is the relevance of converting it to 28 columns when lm function can deal with it?

As I mention previously lm does that for you, so there is no need to do that on your own. However, I need to point out that you will end up with 27 columns (plus the intercept) as one (the reference column) will be left out on purpose. The reason is that by knowing the other 27 companies you implicitly know the 28th as well (i.e. the reference column is 100% correlated with the combination of the other 27, so it needs to be omitted).

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • make sense, one thing that found useful is pointed out by @Imo in a comment. performance if I convert it prior. – Ankit Katiyar Jul 22 '17 at 12:11
  • I haven't tested the actual performance with or without dummy variables, but seeing the source code for `lm` (there is an `if-else` statement that uses `model.matrix` and it is literally one line) I don't think it would provide a big boost. Using `model.matrix` on your own would create additional overhead for you if you want to do predictions, since you would need to use `model.matrix` on the new data, too. – LyzandeR Jul 22 '17 at 12:20