0

I am using AllenNLP to train a hierarchical attention network model. My training dataset consists of a list of JSON objects (eg, each object in the list is a JSON object with keys := ["text", "label"]. The value associated with the text key is a list of lists, eg:

[{"text":[["i", "feel", "sad"], ["not", "sure", "i", "guess", "the", "weather"]], "label":0} ... {"text":[[str]], "label":int}] 

My DatasetReader class looks like:

@DatasetReader.register("my_reader")
class TranscriptDataReader(DatasetReader):
    def __init__(self,
                 token_indexers: Optional[Dict[str, TokenIndexer]] = None,
                 lazy: bool = True) -> None:
        super().__init__(lazy)
        self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}

    def _read(self, file_path: str) -> Iterator[Instance]:
        with open(file_path, 'r') as f:
            data = json.loads(f.read())
            for _,data_json in enumerate(data):
                sent_list = []
                for segment in data_json["text"]:
                    sent_list.append(self.get_text_field(segment))
                yield self.create_instance(sent_list, str(data_json["label"]))

    def get_text_field(self, segment):
        return TextField([Token(token.lower()) for token in segment],self._token_indexers)


    def create_instance(self, sent_list, label):
        label_field = LabelField(label, skip_indexing=False)
        fields = {'tokens': ListField(sent_list), 'label': label_field}
        return Instance(fields)

and in my config file, I have:

{
  dataset_reader: {
    type: 'my_reader',
  },

  train_data_path: 'data/train.json',
  validation_data_path: 'data/dev.json',

 data_loader: {
    batch_sampler: {
      type: 'bucket',
      batch_size: 10
    }
 },

I have tried (alternatively) setting the lazy param for the dataset reader to True and False.

  • When set to True, the model is able to train, however, I observe that only one train and one dev instance actually get loaded, when my dataset contains ~100.
  • When set to False, I've modified the yield line in _read to be return; however, this causes a type error in the base vocabulary class. I've also tried keeping the yield as is when set to False; in this case, no instances get loaded at all, and since the set of instances is empty, the vocabulary does not get instantiated, and the embedding class throws an error.

Would appreciate pointers, and/or tips for debugging.

2 Answers2

1

If you are using allennlp>=v2.0.0, the lazy parameter in the DatasetReader constructor is deprecated. Therefore, your super().__init__(lazy) would be instead interpreted as the new constructor parameter max_instances, i.e. max_instances=True which is equivalent to max_instances=1.

louixp
  • 21
  • 4
0

Can you print and tell us how many instances are getting loaded after reading the json file (added a print command below for clarity)

def _read(self, file_path: str) -> Iterator[Instance]:
        with open(file_path, 'r') as f:
            data = json.loads(f.read())
            print(len(data))
            for _,data_json in enumerate(data):
               sent_list = []
                for segment in data_json["text"]:
                    sent_list.append(self.get_text_field(segment))
                yield self.create_instance(sent_list, str(data_json["label"]))
Nakamura
  • 179
  • 1
  • 9