I'm trying to prepare a custom dataset loaded from a csv file in order to use in a torchtext text binary classification problem. It's a basic dataset with news headlines and a market sentiment label assigned "positive" or "negative". I've been following some online tutorials on PyTorch to get this far but they've made some significant changes in the latest torchtext package so most of the stuff is out of date.
Below I've successfully parsed my csv file into a pandas dataframe with two columns - text headline and a label which is either 0 or 1 for positive/negative, split into a training and test dataset then wrapped them as a PyTorch dataset class:
train, test = train_test_split(eurusd_df, test_size=0.2)
class CustomTextDataset(Dataset):
def __init__(self, text, labels):
self.text = text
self.labels = labels
def __getitem__(self, idx):
label = self.labels.iloc[idx]
text = self.text.iloc[idx]
sample = {"Label": label, "Text": text}
return sample
def __len__(self):
return len(self.labels)
train_dataset = CustomTextDataset(train['Text'], train['Labels'])
test_dataset = CustomTextDataset(test['Text'], test['Labels'])
I'm now trying to build a vocabulary of tokens following this tutorial https://coderzcolumn.com/tutorials/artificial-intelligence/pytorch-simple-guide-to-text-classification and the official pytorch tutorial https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html .
However using the below code
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
tokenizer = get_tokenizer('basic_english')
train_iter = train_dataset
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
yields a very small length of vocabulary, and applying the example vocab(['here', 'is', 'an', 'example'])
on a text field taken from the original dataframe yields a list of 0s, implying the vocab is being built from the label field, containing only 0s and 1s, not the text field. Could anyone review and show me how to build the vocab targeting the text field?