8

It's not a new question, references I found without any solution working for me first and second. I'm a newbie to PyTorch, facing AttributeError: 'Field' object has no attribute 'vocab' while creating batches of the text data in PyTorch using torchtext.

Following up the book Deep Learning with PyTorch I wrote the same example as explained in the book.

Here's the snippet:

from torchtext import data
from torchtext import datasets
from torchtext.vocab import GloVe

TEXT = data.Field(lower=True, batch_first=True, fix_length=20)
LABEL = data.Field(sequential=False)
train, test = datasets.IMDB.splits(TEXT, LABEL)

print("train.fields:", train.fields)
print()
print(vars(train[0]))  # prints the object



TEXT.build_vocab(train, vectors=GloVe(name="6B", dim=300),
                 max_size=10000, min_freq=10)

# VOCABULARY
# print(TEXT.vocab.freqs)  # freq
# print(TEXT.vocab.vectors)  # vectors
# print(TEXT.vocab.stoi)  # Index

train_iter, test_iter = data.BucketIterator.splits(
    (train, test), batch_size=128, device=-1, shuffle=True, repeat=False)  # -1 for cpu, None for gpu

# Not working (FROM BOOK)
# batch = next(iter(train_iter))

# print(batch.text)
# print()
# print(batch.label)

# This also not working (FROM Second solution)
for i in train_iter:
    print (i.text)
    print (i.label)

Here's the stacktrace:

AttributeError                            Traceback (most recent call last)
<ipython-input-33-433ec3a2ca3c> in <module>()
      7 
      8 
----> 9 for i in train_iter:
     10     print (i.text)
     11     print (i.label)

/anaconda3/lib/python3.6/site-packages/torchtext/data/iterator.py in __iter__(self)
    155                     else:
    156                         minibatch.sort(key=self.sort_key, reverse=True)
--> 157                 yield Batch(minibatch, self.dataset, self.device)
    158             if not self.repeat:
    159                 return

/anaconda3/lib/python3.6/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)
     32                 if field is not None:
     33                     batch = [getattr(x, name) for x in data]
---> 34                     setattr(self, name, field.process(batch, device=device))
     35 
     36     @classmethod

/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py in process(self, batch, device)
    199         """
    200         padded = self.pad(batch)
--> 201         tensor = self.numericalize(padded, device=device)
    202         return tensor
    203 

/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py in numericalize(self, arr, device)
    300                 arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
    301             else:
--> 302                 arr = [self.vocab.stoi[x] for x in arr]
    303 
    304             if self.postprocessing is not None:

/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py in <listcomp>(.0)
    300                 arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
    301             else:
--> 302                 arr = [self.vocab.stoi[x] for x in arr]
    303 
    304             if self.postprocessing is not None:

AttributeError: 'Field' object has no attribute 'vocab'

If not using BucketIterator, what else I can use to get a similar output?

Asif Ali
  • 1,422
  • 2
  • 12
  • 28

1 Answers1

12

You haven't built the vocab for the LABEL field.

After TEXT.build_vocab(train, ...), run LABEL.build_vocab(train), and the rest will run.

Proyag
  • 2,132
  • 13
  • 26
  • Why do we have to build vocab for LABEL field? It only contains label. I really don't get it. – beepbeep Jul 22 '20 at 14:58
  • The vocab is needed to map to numerical identifiers, which needs to be done not only for text tokens, but also for the labels – Proyag Jul 22 '20 at 15:53
  • isn't a vocab a map between unique words and IDs that represent them? example objects do have both text and label attributes so arent labels already mapped? I really can't wrap my head around this build_vocab thing, do you maybe know a tutorial that explains this thoroughly? – beepbeep Jul 22 '20 at 19:58
  • Correct, it is a mapping from unique words (or tokens or labels or subwords etc.) to numerical identifiers. In this particular example for the IMDB dataset, the label is "negative" or "positive" for each example, which will simply be mapped to 1 and 2. In other scenarios, you might have more labels, and thus a bigger vocab. Try `print(LABEL.vocab.stoi)` after building the vocab and you'll see a dict that looks like `{'': 0, 'neg': 1, 'pos': 2}` – Proyag Jul 23 '20 at 09:32
  • In this https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/ tutorial, although the target variable was already numerical he used LABEL.build_vocab() the output of LABEL.vocab.stoi is so weird xD – beepbeep Jul 23 '20 at 20:12
  • 2
    If the data is already numerical, you might want to just set `Field.use_vocab` as False – Proyag Jul 29 '20 at 08:22
  • Hi @Proyag : my case is slightly different. I got this error: AttributeError: 'Batch' object has no attribute 'text' even after I do TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE, vectors = "glove.6B.100d", unk_init = torch.Tensor.normal_) LABEL.build_vocab(train_data) any idea why? – chandra sutrisno Sep 20 '20 at 06:12