Categorical Data - One-hot encoding

Question

I have a large list of strings. Each string is a different example in the training dataset and contains a list of categories, whereby each category is separated by a comma. Eg.

mesh = ['aligator, dog, cat', 'cat, mouse, aligator', '']

Some examples may not belong to any category and so will be represented as an empty string.

I wish to use one-hot encoding to encode these categories for use in machine learning.

How can I do this? I do not have a complete list of categories and there are approximately 5,000 possible categories.

your `mesh` doesn't look like `1-dimensional numpy array of strings`... Do you have 1D array of strings or array of arrays of strings? — MaxU - stand with Ukraine, Jun 05 '17 at 15:04
Was there a problem with the answer from @MaxU ? It should work for empty strings — elz, Jun 12 '17 at 20:08

score 1 · Answer 1 · answered Jun 05 '17 at 15:11

1

Demo:

In [64]: from sklearn.feature_extraction.text import CountVectorizer

In [65]: cv = CountVectorizer()

In [66]: X = cv.fit_transform(mesh)

In [67]: X.A
Out[67]:
array([[1, 1, 1, 0],
       [1, 1, 0, 1]], dtype=int64)

column names:

In [68]: cv.get_feature_names()
Out[68]: ['aligator', 'cat', 'dog', 'mouse']

We can visualize it using Pandas.SparseDataFrame:

In [135]: import pandas as pd

In [136]: pd.SparseDataFrame(X, columns=cv.get_feature_names(), default_fill_value=0)
Out[136]:
   aligator  cat  dog  mouse
0         1    1    1      0
1         1    1    0      1

answered Jun 05 '17 at 15:11

MaxU - stand with Ukraine

205,989
36
386
419

Ocassioanly some examples may be an empty string (ie. belong to no categories) and using this code it will crash... – scutnex Jun 05 '17 at 15:27
@scutnex, how can you distinguish whether it's category or not if you don't have a complete list of categories? Can you provide a __reproducible__ sample data set? – MaxU - stand with Ukraine Jun 05 '17 at 15:32
All of the text within a string in this _list_ is a category. – scutnex Jun 05 '17 at 15:35
@scutnex, including `''`?? – MaxU - stand with Ukraine Jun 05 '17 at 15:36
If it's an empty sting we want to classify it as 0 for all of the categories. – scutnex Jun 05 '17 at 15:39

score 0 · Answer 2 · answered Jun 09 '17 at 15:12

There are a bunch of different ways to encode categorical variables for machine learning, we implement a handful of them (including One-Hot) in the scikit-learn-contrib package: category_encoders:

https://github.com/scikit-learn-contrib/categorical-encoding

If you're already using scikit-learn and/or pandas, it may be a simple solution. With very high dimensionality like you mention, and the case where you don't necessarily know all of the categories up front, you may have better luck with something like the HashingEncoder.

Categorical Data - One-hot encoding

2 Answers2