I am trying to run a machine learning problem using scikit learn on a dataset and one of the columns(feature) has high cardinality around 300K unique values.How do I vectorize such a feature. Using DictVectorizer would not be a solution as the machine runs out of memory.
I have read in a few posts that I could just assign numbers to all those string values but would lead to misleading results.
Has anyone dealt with such kind of a feature set.If so, how to vectorize it so I could pass it on to train a model?