2

I have a list of sentences. I want to add padding to them; but when I use keras pad_sequence like this:

from keras.preprocessing.sequence import pad_sequences
s = [["this", "is", "a", "book"], ["this", "is", "not"]]
g = pad_sequences(s, dtype='str', maxlen=10, value='_PAD_')

the result is:

array([['_', '_', '_', '_', '_', '_', 't', 'i', 'a', 'b'],
       ['_', '_', '_', '_', '_', '_', '_', 't', 'i', 'n']], dtype='<U1')

Why it is not working properly?

I want to use this result as the input to the ELMO embedding and I need string sentences not integer encoding.

  • Possible duplicate of [Difference in padding integer and string in keras](https://stackoverflow.com/questions/55220072/difference-in-padding-integer-and-string-in-keras). – giser_yugang May 05 '19 at 14:02

2 Answers2

3

Change dtype to object, It will do the job for you.

from keras.preprocessing.sequence import pad_sequences

s = [["this", "is", "a", "book"], ["this", "is", "not"]]
g = pad_sequences(s, dtype=object, maxlen=10, value='_PAD_')
print(g)

OutPut:

array([['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'this',
        'is', 'a', 'book'],
       ['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_',
        'this', 'is', 'not']], dtype=object)
John
  • 1,212
  • 1
  • 16
  • 30
-1

The text should be first converted into numeric values. Keras provides tokenizer and two methods fit_on_texts and texts_to_sequences to work with text data.

Refer this keras documentation here

Tokenizer : This helps in vectorizing a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count

fit_on_texts: This creates vocabulary index which is based on word frequency.

texts_to_sequences: This transforms each text in texts to a sequence of integers.

from keras.preprocessing import text, sequence
s = ["this", "is", "a", "book", "of my choice"]
tokenizer = text.Tokenizer(num_words=100,lower=True)
tokenizer.fit_on_texts(s)
seq_token = tokenizer.texts_to_sequences(s)
g = sequence.pad_sequences(seq_token, maxlen=10)
g

Output

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 3],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 4],
       [0, 0, 0, 0, 0, 0, 0, 5, 6, 7]], dtype=int32)
joel
  • 1,156
  • 3
  • 15
  • 42