0

I am building up dataset for Seq2Seq model which requires the data to be in the form of one-hot encoded padded sequences.

For Example if my sequence contains 'a' (a), then it should generate something like following (given max sequence size can be 4):

[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]

So I tired to first pad the sequence and then one-hot encode the padded sequences (somewhat answered in this answer).

train_padded_txt_Y1 = to_categorical(pad_sequences(training_txt_Y1, maxlen=max_label_len, padding='post', value = len(char_list)))

However, the above produces one-hot-encoded padded sequences like following, that are where the padding character is being treated as a class to be encoded:

[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
   [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
   [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]]

You can see an additional element in the generated one-hot encoding of each.

So the question here is that can something be done using Keras utilities to get the one-hot encoded padded sequence that I need or do I have to go for some custom implementation?

Tayyab
  • 1,207
  • 8
  • 29
  • What is `char_list`? – Raj Jan 11 '20 at 12:57
  • list of all the characters that the Model can output (assuming every element of the sequence represents a character) – Tayyab Jan 11 '20 at 13:35
  • It's not really clear what you need. Can you add the what `training_txt_Y1` and `char_list` is for this expected output? – thushv89 Jan 15 '20 at 03:08
  • @thushv89 the numpy array under: "For Example if my sequence contains 'a' (a), then it should generate something like following (given max sequence size can be 4):", is what I wanted to be my training_txt_Y1 (as stated in the question). And as I already replied in a previous comment char_list is list of all characters (a,b,c,d,.....). – Tayyab Jan 16 '20 at 06:30

0 Answers0