1

I am trying to run a machine learning problem using scikit learn on a dataset and one of the columns(feature) has high cardinality around 300K unique values.How do I vectorize such a feature. Using DictVectorizer would not be a solution as the machine runs out of memory.

I have read in a few posts that I could just assign numbers to all those string values but would lead to misleading results.

Has anyone dealt with such kind of a feature set.If so, how to vectorize it so I could pass it on to train a model?

zero323
  • 322,348
  • 103
  • 959
  • 935
Gayatri
  • 2,197
  • 4
  • 23
  • 35

1 Answers1

1

Try FeatureHasher. It

is a low-memory alternative to DictVectorizer and CountVectorizer, intended for large-scale (online) learning and situations where memory is tight, e.g. when running prediction code on embedded devices.

Ilya Kolpakov
  • 196
  • 2
  • 8