I am using AllenNLP to train a hierarchical attention network model. My training dataset consists of a list of JSON objects (eg, each object in the list is a JSON object with keys := ["text", "label"]. The value associated with the text key is a list of lists, eg:
[{"text":[["i", "feel", "sad"], ["not", "sure", "i", "guess", "the", "weather"]], "label":0} ... {"text":[[str]], "label":int}]
My DatasetReader class looks like:
@DatasetReader.register("my_reader")
class TranscriptDataReader(DatasetReader):
def __init__(self,
token_indexers: Optional[Dict[str, TokenIndexer]] = None,
lazy: bool = True) -> None:
super().__init__(lazy)
self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
def _read(self, file_path: str) -> Iterator[Instance]:
with open(file_path, 'r') as f:
data = json.loads(f.read())
for _,data_json in enumerate(data):
sent_list = []
for segment in data_json["text"]:
sent_list.append(self.get_text_field(segment))
yield self.create_instance(sent_list, str(data_json["label"]))
def get_text_field(self, segment):
return TextField([Token(token.lower()) for token in segment],self._token_indexers)
def create_instance(self, sent_list, label):
label_field = LabelField(label, skip_indexing=False)
fields = {'tokens': ListField(sent_list), 'label': label_field}
return Instance(fields)
and in my config file, I have:
{
dataset_reader: {
type: 'my_reader',
},
train_data_path: 'data/train.json',
validation_data_path: 'data/dev.json',
data_loader: {
batch_sampler: {
type: 'bucket',
batch_size: 10
}
},
I have tried (alternatively) setting the lazy
param for the dataset reader to True
and False
.
- When set to
True
, the model is able to train, however, I observe that only one train and one dev instance actually get loaded, when my dataset contains ~100. - When set to
False
, I've modified theyield
line in_read
to bereturn
; however, this causes a type error in the base vocabulary class. I've also tried keeping theyield
as is when set toFalse
; in this case, no instances get loaded at all, and since the set of instances is empty, the vocabulary does not get instantiated, and the embedding class throws an error.
Would appreciate pointers, and/or tips for debugging.