I am wondering how to one hot encode text data in pytorch?
For numeric data you could do this
import torch
import torch.functional as F
t = torch.tensor([6,6,7,8,6,1,7], dtype = torch.int64)
one_hot_vector = F.one_hot(x = t, num_classes=9)
print(one_hot_vector.shape)
# Out > torch.Size([7, 9])
But what if you have text data instead
from torchtext.data.utils import get_tokenizer
corpus = ["The cat sat the mat", "The dog ate my homework"]
tokenizer = get_tokenizer("basic_english")
tokens = [tokenizer(doc) for doc in corpus]
But how do I one hot encode this vocab using Pytorch?
With something like Scikit Learn I could do this, is there a similar way to do in pytorch
import spacy
from spacy.lang.en import English
from sklearn.preprocessing import OneHotEncoder
corpus = ["The cat sat the mat", "The dog ate my homework"]
nlp = English()
tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
tokens = np.array([[token for token in tokenizer(doc)] for doc in corpus])
one_hot_encoder = OneHotEncoder(sparse = False)
one_hot_encoded = one_hot_encoder.fit_transform(tokens)