Features with High Cardinality ( How to Vectorize them?)

Question

I am trying to run a machine learning problem using scikit learn on a dataset and one of the columns(feature) has high cardinality around 300K unique values.How do I vectorize such a feature. Using DictVectorizer would not be a solution as the machine runs out of memory.

I have read in a few posts that I could just assign numbers to all those string values but would lead to misleading results.

Has anyone dealt with such kind of a feature set.If so, how to vectorize it so I could pass it on to train a model?

score 1 · Accepted Answer · answered Oct 09 '15 at 17:10

1

Try FeatureHasher. It

is a low-memory alternative to DictVectorizer and CountVectorizer, intended for large-scale (online) learning and situations where memory is tight, e.g. when running prediction code on embedded devices.

answered Oct 09 '15 at 17:10

Ilya Kolpakov

196
2
8

Features with High Cardinality ( How to Vectorize them?)

1 Answers1