0

I am trying to use Allennlp library to perform NER. The library works perfectly fine with conll2003 and other databases which only have entities and tokens (I had to update _read function for the same). But the function returns "ValueError: not enough values to unpack (expected 2, got 1)" if I try to use my own dataset. I have compared the formatting, special characters, spacing, and even file names but couldn't find any issue. This is the sample from the dataset which worked,

O   show
O   me
O   films
O   with
B-ACTOR drew
I-ACTOR barrymore
O   from
O   the
B-YEAR  1980s

O   what
O   movies
O   starred
O   both
B-ACTOR al
I-ACTOR pacino

This is the sample from my dataset which is not working,

O   dated
O   as
O   of
B-STARTDATE February
I-STARTDATE 9
I-STARTDATE ,
L-STARTDATE 2017
O   by
O   and
O   between
O   Allenware
O   Ltd

I am not able to identify the issue, please help.

Update

adding stderr.log as requested.

0it [00:00, ?it/s]
1it [00:00, 556.72it/s]

0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/allennlp/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/allennlp/lib/python3.6/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/allennlp/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/allennlp/lib/python3.6/site-packages/allennlp/commands/train.py", line 124, in train_model_from_args
    args.cache_prefix)
  File "/allennlp/lib/python3.6/site-packages/allennlp/commands/train.py", line 168, in train_model_from_file
    cache_directory, cache_prefix)
  File "/allennlp/lib/python3.6/site-packages/allennlp/commands/train.py", line 226, in train_model
    cache_prefix)
  File "/allennlp/lib/python3.6/site-packages/allennlp/training/trainer_pieces.py", line 42, in from_params
    all_datasets = training_util.datasets_from_params(params, cache_directory, cache_prefix)
  File "/allennlp/lib/python3.6/site-packages/allennlp/training/util.py", line 185, in datasets_from_params
    validation_data = validation_and_test_dataset_reader.read(validation_data_path)
  File "/allennlp/lib/python3.6/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 134, in read
    instances = [instance for instance in Tqdm.tqdm(instances)]
  File "/allennlp/lib/python3.6/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 134, in <listcomp>
    instances = [instance for instance in Tqdm.tqdm(instances)]
  File "/allennlp/lib/python3.6/site-packages/tqdm/std.py", line 1081, in __iter__
    for obj in iterable:
  File "/allennlp/lib/python3.6/site-packages/allennlp/data/dataset_readers/conll2003.py", line 119, in _read
    ner_tags,tokens_ = fields
ValueError: not enough values to unpack (expected 2, got 1)
0it [00:00, ?it/s]

Adding _read and text_to_instance functions

@overrides
    def _read(self, file_path: str) -> Iterable[Instance]:
        # if `file_path` is a URL, redirect to the cache
        file_path = cached_path(file_path)

        with open(file_path, "r") as data_file:
            logger.info("Reading instances from lines in file at: %s", file_path)

            # Group into alternative divider / sentence chunks.
            for is_divider, lines in itertools.groupby(data_file, _is_divider):
                # Ignore the divider chunks, so that `lines` corresponds to the words
                # of a single sentence.
                if not is_divider:
                    fields = [line.strip().split() for line in lines]
                    # unzipping trick returns tuples, but our Fields need lists
                    fields = [list(field) for field in zip(*fields)]
                    ner_tags,tokens_ = fields
                    # TextField requires ``Token`` objects
                    tokens = [Token(token) for token in tokens_]

                    yield self.text_to_instance(tokens,ner_tags)

    def text_to_instance(  # type: ignore
        self,
        tokens: List[Token],
        ner_tags: List[str] = None,
    ) -> Instance:
        """
        We take `pre-tokenized` input here, because we don't have a tokenizer in this class.
        """

        sequence = TextField(tokens, self._token_indexers)
        instance_fields: Dict[str, Field] = {"tokens": sequence}
        instance_fields["metadata"] = MetadataField({"words": [x.text for x in tokens]})
        coded_ner=ner_tags
        if 'ner' in self.feature_labels:
            if coded_ner is None:
                raise ConfigurationError("Dataset reader was specified to use NER tags as "
                                         " features. Pass them to text_to_instance.")
            instance_fields['ner_tags'] = SequenceLabelField(coded_ner, sequence, "ner_tags")
        if self.tag_label == 'ner' and coded_ner is not None:
            instance_fields['tags'] = SequenceLabelField(coded_ner, sequence,self.label_namespace)
        return Instance(instance_fields)
  • Please [edit] the question to provide the full traceback. If you can, please also indicate which input exactly triggered it. (Maybe you can trim down your sample data to a single line o# input to produce a proper [mre]?) – tripleee Nov 01 '19 at 13:09
  • Does the data contain tabs or sequences of literal spaces? – tripleee Nov 01 '19 at 13:11
  • I tried with only one record but gave the same error. i also tried with tab and space both none of them worked. – sohel shaikh Nov 01 '19 at 13:16
  • If you changed the `_read` function, show us how you changed it. I notice that it provides additional logging so you should be able to get more information by increasing the log level. I'm looking at https://github.com/allenai/allennlp/blob/35b285585e0677b1025eac1c19b5eefe7e2a70db/allennlp/data/dataset_readers/conll2003.py#L119 – tripleee Nov 01 '19 at 13:26
  • I have updated the question and added both the function. – sohel shaikh Nov 01 '19 at 13:39

0 Answers0