Do I need to transform nominal variables to be distinct fields for sklearn random forest?

Question

This is a sample of dataset I'm using to look at lapsed customers. I've converted categorical values to be numbers. However I believe that sklearn random forest will treat these fields as discrete numbers e.g. assume that customer number 4 is double that of customer number 2? Do I need to cross-tab or vectorize these values before applying my random forest model?

Lapse_Flag,Cust,Sales,Cust Age,State,Main Sales Territory 0,1,28.46,3,1,1 0,2,46.07,3,2,1 0,3,108.48,3,3,2 1,4,265,3,4,3 0,5,54.42,3,5,4 0,6,0,1,6,3 0,7,371.93,3,7,5 1,8,35.6,3,8,6 1,9,357.95,2,9,7 0,10,5584.14,3,5,4 0,11,41207.02,3,10,4 0,12,5958.18,3,5,4 0,13,1028.14,1,11,7 0,14,446.67,2,7,5 0,15,0,3,1,1 0,16,6256,2,12,7 0,17,4618.72,3,2,1 1,18,275.58,3,12,2 1,19,1417.22,2,8,6

score 0 · Answer 1 · answered Oct 06 '14 at 20:38

I assume your customers can only be of one exact type so it's not a multi-label problem.

Then using a classifier with RandomForestClassifier() is fine to not do a regression, and sklearn deals with the multi-class.

See here for the multi-class/multi-label documentation with an example of a similar case here.

You can also use the LabelEncoder you transform your data.

Do I need to transform nominal variables to be distinct fields for sklearn random forest?

1 Answers1