I am vectorizing some features in sklearn, and I have run into a problem. DictVectorizer works well if your data can be encoded into one dict key per item. What if your items can have two or more values of the same column? For instance, DictVectorizer works fine on an item like this one:
{'a': 'b', 'b': 'c'}
But what about something like this, with more than one value per column?
{'a': ['b','c'], 'b': 'd'}
The strategy of one-hot-encoding can still apply, you simply want two a columns… a=b and a=c. So far as I can tell, no such vectorizer exists! What is one supposed to do in this situation? Do I need to create my own MultiDictVectorizer?
I wrote about this in a blog post here, before posting.