4

I am vectorizing some features in sklearn, and I have run into a problem. DictVectorizer works well if your data can be encoded into one dict key per item. What if your items can have two or more values of the same column? For instance, DictVectorizer works fine on an item like this one:

{'a': 'b', 'b': 'c'}

But what about something like this, with more than one value per column?

{'a': ['b','c'], 'b': 'd'}

The strategy of one-hot-encoding can still apply, you simply want two a columns… a=b and a=c. So far as I can tell, no such vectorizer exists! What is one supposed to do in this situation? Do I need to create my own MultiDictVectorizer?

I wrote about this in a blog post here, before posting.

Greg Reda
  • 1,744
  • 2
  • 13
  • 20
rjurney
  • 4,824
  • 5
  • 41
  • 62

2 Answers2

1

There are at least two quick possible solutions to this situation:

  1. Create a new value that represents the possibility of having two aggregated values

    {'a': 'bc', 'b': 'd'} or give it another name, i.e. 'bc'-->'e'

  2. Replicate the sample each time taking one of the values

    {'a': 'b', 'b': 'd'} and {'a': 'c', 'b': 'd'}

But of course it depends a lot on the context of your problem (case 2: is it correct to 'duplicate' a sample with different manifestations? case 1: is conceptually acceptable another new value of the feature? ). And I don't even know if that multi-valued feature corresponds to a N/A situation, for example.

I've seen your github proposal so I understand this is not exactly what you want, but just in case it could save you the effort.

Guiem Bosch
  • 2,728
  • 1
  • 21
  • 37
0

DictVectorizer can't handle multiple values per key, so I am adding this ability to it. If the pull is accepted, this will be a part of sklearn. If not, I will subclass DictVectorizer in MultiDictVectorizer and will release a package for this class.

Pull request at Github

Issue in sklearn Github project

rjurney
  • 4,824
  • 5
  • 41
  • 62