2

I am trying to train a model for NMT on a custom dataset. I found this great tutorial on youtube along with the accompanying repo, but it uses an old version of PyTorch and torchtext. More recent versions of torchtext have removed the Field and BucketIterator classes.

I looked for more recent tutorials. The closest thing I could find was this medium post (again with the accompanying code) which worked with a custom dataset for text classification. I tried to replicate the code with my problem and got this far:

from os import PathLike
from torch.utils.data import Dataset
from torchtext.vocab import Vocab
import pandas as pd
from .create_vocab import tokenizer


class ParallelCorpus(Dataset):
    """A parallel corpus for training a machine translation model"""

    def __init__(self,
                    corpus_path: str | PathLike,
                    source_vocab: Vocab,
                    target_vocab: Vocab
                    ):
        super().__init__()
        self.corpus = pd.read_csv(corpus_path)
        self.source_vocab = source_vocab
        self.target_vocab = target_vocab

    def __len__(self):
        return len(self.corpus)

    def __getitem__(self, index: int):

        source_sentence = self.corpus.iloc[index, 0]
        source = [self.source_vocab["<sos>"]]
        source.extend(
            self.source_vocab.lookup_indices(tokenizer(source_sentence))
        )
        source.append(self.source_vocab["<eos>"])

        target_sentence = self.corpus.iloc[index, 1]
        target = [self.target_vocab["<sos>"]]
        target.extend(
            self.target_vocab.lookup_indices(tokenizer(target_sentence))
        )
        target.append(self.target_vocab["<eos>"])

        return source, target

My question is: is this the correct way to implement parallel corpora for pytorch? And where can I find more information about this since the documentation wasn't much help.

Thank you in advance and sorry if this is against the rules.

0sharp
  • 43
  • 6

0 Answers0