3

I have a list like similar to this:

list = ['Opinion, Journal, Editorial',
        'Opinion, Magazine, Evidence-based',
        'Evidence-based']

where the commas split between categories eg. Opinion and Journal are two separate categories. The real list is much larger and has more possible categories. I would like to use one-hot encoding to transform the list so that it can be used for machine learning. For example, from that list I would like to produce a sparse matrix containing data like:

list = [[1, 1, 1, 0, 0],
        [1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1]]

Ideally, I would like to use scikit-learn's one hot encoder as I presume this would be the most efficient.

In response to @nbrayns comment:

The idea is to transform the list of categories from text to a vector wherby if it belongs to that category it will be assigned 1, otherwise 0. For the above example, the headings would be:

headings = ['Opinion', 'Journal', 'Editorial', 'Magazine', 'Evidence-based']
Community
  • 1
  • 1
user7347576
  • 236
  • 2
  • 5
  • 15

4 Answers4

10

If you are able to use Pandas, this functionality is essentially built-in there:

import pandas as pd

l = ['Opinion, Journal, Editorial', 'Opinion, Magazine, Evidence-based', 'Evidence-based']
pd.Series(l).str.get_dummies(', ')
   Editorial  Evidence-based  Journal  Magazine  Opinion
0          1               0        1         0        1
1          0               1        0         1        1
2          0               1        0         0        0

If you'd like to stick with the sklearn ecosystem, you are looking for MultiLabelBinarizer, not for OneHotEncoder. As the name implies, OneHotEncoder only supports one level per sample per category, while your dataset has multiple.

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()  # pass sparse_output=True if you'd like
mlb.fit_transform(s.split(', ') for s in l)
[[1 0 1 0 1]
 [0 1 0 1 1]
 [0 1 0 0 0]]

To map the columns back to categorical levels, you can access mlb.classes_. For the above example, this gives ['Editorial' 'Evidence-based' 'Journal' 'Magazine' 'Opinion'].

Igor Raush
  • 15,080
  • 1
  • 34
  • 55
1

One more way:

l = ['Opinion, Journal, Editorial', 'Opinion, Magazine, Evidence-based', 'Evidence-based']

# Get list of unique classes
classes = list(set([j for i in l for j in i.split(', ')]))
=> ['Journal', 'Opinion', 'Editorial', 'Evidence-based', 'Magazine']

# Get indices in the matrix
indices = np.array([[k, classes.index(j)] for k, i in enumerate(l) for j in i.split(', ')])
=> array([[0, 1],
          [0, 0],
          [0, 2],
          [1, 1],
          [1, 4],
          [1, 3],
          [2, 3]])

# Generate output
output = np.zeros((len(l), len(classes)), dtype=int)
output[indices[:, 0], indices[:, 1]]=1
=> array([[ 1,  1,  1,  0,  0],
          [ 0,  1,  0,  1,  1],
          [ 0,  0,  0,  1,  0]])
Andrzej Pronobis
  • 33,828
  • 17
  • 76
  • 92
0

This may not be the most efficient method, but probably easy to grasp.
If you don't already have a list of all possible words, you need to create that. In the code below it's called unique. The columns of the output matrix s will then correspond to those unique words; the rows will be the item from the list.

import numpy as np

lis = ['Opinion, Journal, Editorial','Opinion, Magazine, Evidence-based','Evidence-based']

unique=list(set(", ".join(lis).split(", ")))
print unique
# prints ['Opinion', 'Journal', 'Magazine', 'Editorial', 'Evidence-based']

s = np.zeros((len(lis), len(unique)))
for i, item in enumerate(lis):
    for j, notion in enumerate(unique):
        if notion in item:
            s[i,j] = 1

print s
# prints [[ 1.  1.  0.  1.  0.]
#         [ 1.  0.  1.  0.  1.]
#         [ 0.  0.  0.  0.  1.]]
ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
-1

Very easy in pandas:

import pandas as pd
s = pd.Series(['a','b','c'])
pd.get_dummies(s)

Output:

   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
exp1orer
  • 11,481
  • 7
  • 38
  • 51