1

I just started using pytorch for NLP. I found a tutorial that uses from keras.preprocessing.text import one_hot and converts text to one_hot representation given a vocabulary size.

For example:

The input is

vocab_size = 10000
sentence = ['the glass of milk',
            'the cup of tea',
            'I am a good boy']

onehot_repr = [one_hot(words, vocab_size) for words in sentence] 

The output is"

[[6654, 998, 8896, 1609], [6654, 998, 1345, 879], [123, 7653, 1, 5678,7890]]

how can i perform the same procedure in pytorch and get the output like above.

iacob
  • 20,084
  • 6
  • 92
  • 119
Zeeshan Ali
  • 49
  • 1
  • 7

1 Answers1

1

PyTorch fundamentally works with Tensors, and is not designed to work with strings. You can use SK Learn's LabelEncoder to encode your words however:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit([w for s in sentence for w in s.split()])

onehot_repr = [le.transform(s.split()) for s in sentence]
>>> [array([10,  5,  8,  7]), array([10,  4,  8,  9]), array([0, 2, 1, 6, 3])]
iacob
  • 20,084
  • 6
  • 92
  • 119