19

I have a list:

code = ['<s>', 'are', 'defined', 'in', 'the', '"editable', 'parameters"', '\n', 'section.', '\n', 'A', 'larger', '`tsteps`', 'value', 'means', 'that', 'the', 'LSTM', 'will', 'need', 'more', 'memory', '\n', 'to', 'figure', 'out']

And I want to convert to one hot encoding. I tried:

to_categorical(code)

And I get an error: ValueError: invalid literal for int() with base 10: '<s>'

What am I doing wrong?

Shamoon
  • 41,293
  • 91
  • 306
  • 570
  • 1
    Per the [docs](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical#arguments), the argument for `to_categorical` needs to be a vector of integers, not strings – C.Nivs May 20 '19 at 20:31
  • How can I convert those strings to integers? – Shamoon May 20 '19 at 20:37

4 Answers4

21

keras only supports one-hot-encoding for data that has already been integer-encoded. You can manually integer-encode your strings like so:

Manual encoding

# this integer encoding is purely based on position, you can do this in other ways
integer_mapping = {x: i for i,x in enumerate(code)}

vec = [integer_mapping[word] for word in code]
# vec is
# [0, 1, 2, 3, 16, 5, 6, 22, 8, 22, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]

Using scikit-learn

from sklearn.preprocessing import LabelEncoder
import numpy as np

code = np.array(code)

label_encoder = LabelEncoder()
vec = label_encoder.fit_transform(code)

# array([ 2,  6,  7,  9, 19,  1, 16,  0, 17,  0,  3, 10,  5, 21, 11, 18, 19,
#         4, 22, 14, 13, 12,  0, 20,  8, 15])

You can now feed this into keras.utils.to_categorical:

from keras.utils import to_categorical

to_categorical(vec)
C.Nivs
  • 12,353
  • 2
  • 19
  • 44
8

instead use

pandas.get_dummies(y_train)
Abhishek Bhagate
  • 5,583
  • 3
  • 15
  • 32
unknown
  • 81
  • 1
  • 1
1

tf.keras.layers.CategoryEncoding

In TF 2.6.0, One Hot Encoding (OHE) or Multi Hot Encoding (MHE) can be implemented using tf.keras.layers.CategoryEncoding , tf.keras.layers.StringLookup, and tf.keras.layers.IntegerLookup.

I think this way is not plausible in TF 2.4.x so it must have been implemented after.

enter image description here

See Classify structured data using Keras preprocessing layers for the actual implementation.

def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a layer that turns strings into integer indices.
  if dtype == 'string':
    index = layers.StringLookup(max_tokens=max_tokens)
  # Otherwise, create a layer that turns integer values into integer indices.
  else:
    index = layers.IntegerLookup(max_tokens=max_tokens)

  # Prepare a `tf.data.Dataset` that only yields the feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Encode the integer indices.
  encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())

  # Apply multi-hot encoding to the indices. The lambda function captures the
  # layer, so you can use them, or include them in the Keras Functional model later.
  return lambda feature: encoder(index(feature))
mon
  • 18,789
  • 22
  • 112
  • 205
-2

Try converting it to a numpy array first:

from numpy import array

and then:

to_categorical(array(code))

mattemyo
  • 99
  • 1
  • 4