-3

I have a large list of strings. Each string is a different example in the training dataset and contains a list of categories, whereby each category is separated by a comma. Eg.

mesh = ['aligator, dog, cat', 'cat, mouse, aligator', '']

Some examples may not belong to any category and so will be represented as an empty string.

I wish to use one-hot encoding to encode these categories for use in machine learning.

How can I do this? I do not have a complete list of categories and there are approximately 5,000 possible categories.

scutnex
  • 813
  • 1
  • 9
  • 19

2 Answers2

1

Demo:

In [64]: from sklearn.feature_extraction.text import CountVectorizer

In [65]: cv = CountVectorizer()

In [66]: X = cv.fit_transform(mesh)

In [67]: X.A
Out[67]:
array([[1, 1, 1, 0],
       [1, 1, 0, 1]], dtype=int64)

column names:

In [68]: cv.get_feature_names()
Out[68]: ['aligator', 'cat', 'dog', 'mouse']

We can visualize it using Pandas.SparseDataFrame:

In [135]: import pandas as pd

In [136]: pd.SparseDataFrame(X, columns=cv.get_feature_names(), default_fill_value=0)
Out[136]:
   aligator  cat  dog  mouse
0         1    1    1      0
1         1    1    0      1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

There are a bunch of different ways to encode categorical variables for machine learning, we implement a handful of them (including One-Hot) in the scikit-learn-contrib package: category_encoders:

https://github.com/scikit-learn-contrib/categorical-encoding

If you're already using scikit-learn and/or pandas, it may be a simple solution. With very high dimensionality like you mention, and the case where you don't necessarily know all of the categories up front, you may have better luck with something like the HashingEncoder.