How can I one hot encode a list of strings with Keras?

Question

I have a list:

code = ['<s>', 'are', 'defined', 'in', 'the', '"editable', 'parameters"', '\n', 'section.', '\n', 'A', 'larger', '`tsteps`', 'value', 'means', 'that', 'the', 'LSTM', 'will', 'need', 'more', 'memory', '\n', 'to', 'figure', 'out']

And I want to convert to one hot encoding. I tried:

to_categorical(code)

And I get an error: ValueError: invalid literal for int() with base 10: '<s>'

What am I doing wrong?

Per the [docs](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical#arguments), the argument for `to_categorical` needs to be a vector of integers, not strings — C.Nivs, May 20 '19 at 20:31

score 21 · Accepted Answer · answered May 20 '19 at 20:46

keras only supports one-hot-encoding for data that has already been integer-encoded. You can manually integer-encode your strings like so:

Manual encoding

# this integer encoding is purely based on position, you can do this in other ways
integer_mapping = {x: i for i,x in enumerate(code)}

vec = [integer_mapping[word] for word in code]
# vec is
# [0, 1, 2, 3, 16, 5, 6, 22, 8, 22, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

Using scikit-learn

from sklearn.preprocessing import LabelEncoder
import numpy as np

code = np.array(code)

label_encoder = LabelEncoder()
vec = label_encoder.fit_transform(code)

# array([ 2,  6,  7,  9, 19,  1, 16,  0, 17,  0,  3, 10,  5, 21, 11, 18, 19,
#         4, 22, 14, 13, 12,  0, 20,  8, 15])

You can now feed this into keras.utils.to_categorical:

from keras.utils import to_categorical

to_categorical(vec)

if `vec` is the same, then yes, `to_categorical` will return the same values — C.Nivs, May 20 '19 at 21:17

score 8 · Answer 2 · edited Jun 21 '20 at 11:21

8

instead use

pandas.get_dummies(y_train)

edited Jun 21 '20 at 11:21

Abhishek Bhagate

5,583
3
15
32

answered Jun 21 '20 at 11:02

unknown

81
1
1

mon · Answer 3 · 2021-11-01T00:08:24.280

tf.keras.layers.CategoryEncoding

In TF 2.6.0, One Hot Encoding (OHE) or Multi Hot Encoding (MHE) can be implemented using tf.keras.layers.CategoryEncoding , tf.keras.layers.StringLookup, and tf.keras.layers.IntegerLookup.

I think this way is not plausible in TF 2.4.x so it must have been implemented after.

See Classify structured data using Keras preprocessing layers for the actual implementation.

def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a layer that turns strings into integer indices.
  if dtype == 'string':
    index = layers.StringLookup(max_tokens=max_tokens)
  # Otherwise, create a layer that turns integer values into integer indices.
  else:
    index = layers.IntegerLookup(max_tokens=max_tokens)

  # Prepare a `tf.data.Dataset` that only yields the feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Encode the integer indices.
  encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())

  # Apply multi-hot encoding to the indices. The lambda function captures the
  # layer, so you can use them, or include them in the Keras Functional model later.
  return lambda feature: encoder(index(feature))

score -2 · Answer 4 · answered May 20 '19 at 20:31

-2

Try converting it to a numpy array first:

from numpy import array

and then:

to_categorical(array(code))

answered May 20 '19 at 20:31

mattemyo

99
1
4

How can I one hot encode a list of strings with Keras?

4 Answers4

Manual encoding

Using scikit-learn

tf.keras.layers.CategoryEncoding

Linked

Related