One-hot encoding of categories

Question

I have a list like similar to this:

list = ['Opinion, Journal, Editorial',
        'Opinion, Magazine, Evidence-based',
        'Evidence-based']

where the commas split between categories eg. Opinion and Journal are two separate categories. The real list is much larger and has more possible categories. I would like to use one-hot encoding to transform the list so that it can be used for machine learning. For example, from that list I would like to produce a sparse matrix containing data like:

list = [[1, 1, 1, 0, 0],
        [1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1]]

Ideally, I would like to use scikit-learn's one hot encoder as I presume this would be the most efficient.

In response to @nbrayns comment:

The idea is to transform the list of categories from text to a vector wherby if it belongs to that category it will be assigned 1, otherwise 0. For the above example, the headings would be:

headings = ['Opinion', 'Journal', 'Editorial', 'Magazine', 'Evidence-based']

What values should be 1, and what should be 0? – nbryans Feb 03 '17 at 23:06 — nbryans, Feb 03 '17 at 23:06
@nbryans Have edited the question. – user7347576 Feb 03 '17 at 23:27 — user7347576, Feb 03 '17 at 23:27

Igor Raush · Answer 1 · 2017-02-04T00:19:43.950

If you are able to use Pandas, this functionality is essentially built-in there:

import pandas as pd

l = ['Opinion, Journal, Editorial', 'Opinion, Magazine, Evidence-based', 'Evidence-based']
pd.Series(l).str.get_dummies(', ')

   Editorial  Evidence-based  Journal  Magazine  Opinion
0          1               0        1         0        1
1          0               1        0         1        1
2          0               1        0         0        0

If you'd like to stick with the sklearn ecosystem, you are looking for MultiLabelBinarizer, not for OneHotEncoder. As the name implies, OneHotEncoder only supports one level per sample per category, while your dataset has multiple.

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()  # pass sparse_output=True if you'd like
mlb.fit_transform(s.split(', ') for s in l)

[[1 0 1 0 1]
 [0 1 0 1 1]
 [0 1 0 0 0]]

To map the columns back to categorical levels, you can access mlb.classes_. For the above example, this gives ['Editorial' 'Evidence-based' 'Journal' 'Magazine' 'Opinion'].

@user7347576 yes, if you're asking whether `Opinion, Journal` or `Journal, Opinion` makes a difference, it doesn't. — Igor Raush, Feb 04 '17 at 23:18

score 1 · Answer 2 · answered Feb 03 '17 at 23:29

One more way:

l = ['Opinion, Journal, Editorial', 'Opinion, Magazine, Evidence-based', 'Evidence-based']

# Get list of unique classes
classes = list(set([j for i in l for j in i.split(', ')]))
=> ['Journal', 'Opinion', 'Editorial', 'Evidence-based', 'Magazine']

# Get indices in the matrix
indices = np.array([[k, classes.index(j)] for k, i in enumerate(l) for j in i.split(', ')])
=> array([[0, 1],
          [0, 0],
          [0, 2],
          [1, 1],
          [1, 4],
          [1, 3],
          [2, 3]])

# Generate output
output = np.zeros((len(l), len(classes)), dtype=int)
output[indices[:, 0], indices[:, 1]]=1
=> array([[ 1,  1,  1,  0,  0],
          [ 0,  1,  0,  1,  1],
          [ 0,  0,  0,  1,  0]])

score 0 · Answer 3 · answered Feb 03 '17 at 23:14

This may not be the most efficient method, but probably easy to grasp.
If you don't already have a list of all possible words, you need to create that. In the code below it's called unique. The columns of the output matrix s will then correspond to those unique words; the rows will be the item from the list.

import numpy as np

lis = ['Opinion, Journal, Editorial','Opinion, Magazine, Evidence-based','Evidence-based']

unique=list(set(", ".join(lis).split(", ")))
print unique
# prints ['Opinion', 'Journal', 'Magazine', 'Editorial', 'Evidence-based']

s = np.zeros((len(lis), len(unique)))
for i, item in enumerate(lis):
    for j, notion in enumerate(unique):
        if notion in item:
            s[i,j] = 1

print s
# prints [[ 1.  1.  0.  1.  0.]
#         [ 1.  0.  1.  0.  1.]
#         [ 0.  0.  0.  0.  1.]]

score -1 · Answer 4 · answered Feb 03 '17 at 23:21

-1

Very easy in pandas:

import pandas as pd
s = pd.Series(['a','b','c'])
pd.get_dummies(s)

Output:

answered Feb 03 '17 at 23:21

exp1orer

11,481
7
38
51

One-hot encoding of categories

4 Answers4