2

I am wondering how to one hot encode text data in pytorch?

For numeric data you could do this

import torch
import torch.functional as F

t = torch.tensor([6,6,7,8,6,1,7], dtype = torch.int64)
one_hot_vector = F.one_hot(x = t, num_classes=9)
print(one_hot_vector.shape)
# Out > torch.Size([7, 9])

But what if you have text data instead

from torchtext.data.utils import get_tokenizer
corpus = ["The cat sat the mat", "The dog ate my homework"]
tokenizer = get_tokenizer("basic_english")
tokens = [tokenizer(doc) for doc in corpus]

But how do I one hot encode this vocab using Pytorch?

With something like Scikit Learn I could do this, is there a similar way to do in pytorch

import spacy
from spacy.lang.en import English
from sklearn.preprocessing import OneHotEncoder

corpus = ["The cat sat the mat", "The dog ate my homework"]
nlp = English()
tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
tokens = np.array([[token for token in tokenizer(doc)] for doc in corpus])
one_hot_encoder = OneHotEncoder(sparse = False)
one_hot_encoded = one_hot_encoder.fit_transform(tokens)
Talha Tayyab
  • 8,111
  • 25
  • 27
  • 44
imantha
  • 2,676
  • 4
  • 23
  • 46
  • After tokenization, you are converting a string into a list of indices, which is similar to your initial example, and should be able to be converted to one-hot vectors directly, what error/issue are you getting? – TYZ Feb 16 '22 at 23:04

1 Answers1

2

You can do the following:

from typing import Union, Iterable
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

corpus = ["The cat sat the mat", "The dog ate my homework"]
tokenizer = get_tokenizer("basic_english")
tokens = [tokenizer(doc) for doc in corpus]

voc = build_vocab_from_iterator(tokens)

def my_one_hot(voc, keys: Union[str, Iterable]):
    if isinstance(keys, str):
        keys = [keys]
    return F.one_hot(torch.tensor(voc(keys)), num_classes=len(voc))
aretor
  • 2,379
  • 2
  • 22
  • 38
  • Did you forget to include the Union function? or is it some function brought by some library? – imantha Feb 18 '22 at 02:11
  • Ah yes sorry, it is from the Python `typing` library, I used it to better clarify that you can pass either a string or an `Iterable` (list, tuple, etc.), but you can omit it. I have updated the answer accordingly. – aretor Feb 18 '22 at 10:28