How to resolve coreferences without Internet using AllenNLP and coref-spanbert-large?

Question

A want to resolve coreferences without Internet using AllenNLP and coref-spanbert-large model. I try to do it in the way that is describing here https://demo.allennlp.org/coreference-resolution

My code:

from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging

predictor = Predictor.from_path(r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz")
example = 'Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen.Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates, two years younger, with whom he shared an enthusiasm for computers.'
pred = predictor.predict(document=example)
coref_res = predictor.coref_resolved(example)
print(pred)
print(coref_res)

When I have an access to internet the code works correctly. But when I don't have an access to internet I get the following errors:

Traceback (most recent call last):
  File "C:/Users/aap/Desktop/CoreNLP/Coref_AllenNLP.py", line 14, in <module>
    predictor = Predictor.from_path(r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz")
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\predictors\predictor.py", line 361, in from_path
    load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\models\archival.py", line 206, in load_archive
    config.duplicate(), serialization_dir
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\models\archival.py", line 232, in _load_dataset_readers
    dataset_reader_params, serialization_dir=serialization_dir
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 604, in from_params
    **extras,
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 632, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 200, in create_kwargs
    cls.__name__, param_name, annotation, param.default, params, **extras
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 307, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 391, in construct_arg
    **extras,
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 341, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 604, in from_params
    **extras,
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 634, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\token_indexers\pretrained_transformer_mismatched_indexer.py", line 63, in __init__
    **kwargs,
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\token_indexers\pretrained_transformer_indexer.py", line 58, in __init__
    model_name, tokenizer_kwargs=tokenizer_kwargs
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\tokenizers\pretrained_transformer_tokenizer.py", line 71, in __init__
    model_name, add_special_tokens=False, **tokenizer_kwargs
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\cached_transformers.py", line 110, in get_tokenizer
    **kwargs,
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 362, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\models\auto\configuration_auto.py", line 368, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\configuration_utils.py", line 424, in get_config_dict
    use_auth_token=use_auth_token,
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\file_utils.py", line 1087, in cached_path
    local_files_only=local_files_only,
  File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\file_utils.py", line 1268, in get_from_cache
    "Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Process finished with exit code 1

Please, say me, what do I need to do my code works without Internet?

score 1 · Answer 1 · answered May 14 '21 at 16:58

1

You will need a local copy of transformer model's configuration file and vocabulary so that the tokenizer and token indexer don't need to download those:

from transformers import AutoConfig, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(transformer_model_name)
config = AutoConfig.from_pretrained(transformer_model_name)
tokenizer.save_pretrained(local_config_path)
config.to_json_file(local_config_path + "/config.json")

You will then need to override the transformer model name in the configuration file to the local directory (local_config_path) where you saved these things:

predictor = Predictor.from_path(
    r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz",
    overrides={
        "dataset_reader.token_indexers.tokens.model_name": local_config_path,
        "validation_dataset_reader.token_indexers.tokens.model_name": local_config_path,
        "model.text_field_embedder.tokens.model_name": local_config_path,
    },
)

answered May 14 '21 at 16:58

petew

671
8
13

Thank you, @petew! What do I need to write instead of 'transformer_model_name'? Do I need one more model? – Daisy May 17 '21 at 14:27
I believe the mode you're using uses the "SpanBERT/spanbert-large-cased" pretrained model: https://huggingface.co/SpanBERT/spanbert-large-cased – petew May 17 '21 at 17:13
Thanks a lot for response. Sorry, I don't understand why it doesn't work. I downloaded a model with four files, wrote a path to it instead of 'transformer_model_name'. Here is my code now (besides imports): `transformer_model_name = r"C:\Users\Анна\Desktop\spanbert-large-cased" local_config_path = r"C:\Users\Анна\Desktop\spanbert-large-cased\span" tokenizer = AutoTokenizer.from_pretrained(transformer_model_name) config = AutoConfig.from_pretrained(transformer_model_name) tokenizer.save_pretrained(local_config_path) config.to_json_file(local_config_path + "/config.json")` – Daisy May 17 '21 at 19:59
`predictor = Predictor.from_path( r"C:\Users\Анна\Desktop\Работа\CoreNLP\coref-spanbert-large-2021.03.10.tar.gz", overrides={ "dataset_reader.token_indexers.tokens.model_name": local_config_path, "validation_dataset_reader.token_indexers.tokens.model_name": local_config_path, "model.text_field_embedder.tokens.model_name": local_config_path, }, )` The rest part is the same as it was early. FInally, I receive the same errors: – Daisy May 17 '21 at 20:07
`Traceback (most recent call last): File "C:/Users/Анна/Desktop/Работа/CoreNLP/coref_allen.py", line 19, in "model.text_field_embedder.tokens.model_name": local_config_path, File "C:\Users\Анна\Desktop\Работа\CoreNLP\corenlp\lib\site-packages\allennlp\predictors\predictor.py", line 361, in from_path ` etc So what do I do wrong? Why doesnt't it work? – Daisy May 17 '21 at 20:09
Can you tell which file `transformers` is trying to download when it fails? – petew May 17 '21 at 22:29
I don't know exactly. Maybe some file form the folder `... cache\transformers`. In this directory there are several files with long names which were loaded early with Internet. When I tried to put files from spanbert-large-cased there it didn't help. Maybe `transformers` try to download some other files. Where I can see it exactly? – Daisy May 18 '21 at 06:12
Mistakes begins in this way: `C:\Users\aap/.cache\huggingface\transformers C:\Users\aap/.cache\huggingface\transformers\1a1dfe6956710e7344f6fc7595b16b878615c5f6f2b91e9699f6c8787af0d2fb Traceback (most recent call last): File "C:/Users/aap/Desktop/CoreNLP/Coref_AllenNLP.py", line 28, in "model.text_field_embedder.tokens.model_name": local_config_path, File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\predictors\predictor.py", line 361, in from_path load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),` – Daisy May 18 '21 at 08:18
It's difficult for me to tell what's going on here since I can't see the full stack trace. Would mind posting this question with the full stack trace on GitHub Discussions instead? https://github.com/allenai/allennlp/discussions – petew May 19 '21 at 17:08
Yes, I did it https://github.com/allenai/allennlp/discussions/5215 – Daisy May 20 '21 at 06:56
Is there a solution to it? – user6083088 Sep 01 '22 at 19:51

score 0 · Answer 2 · answered Oct 31 '22 at 12:18

I have run into similar problem when using structured-prediction-srl-bert without internet, and I saw in the logs 4 item for downloads:

dataset_reader.bert_model_name = bert-base-uncased, Downloading 4 files
model INFO vocabulary.py - Loading token dictionary from data/structured-prediction-srl-bert.2020.12.15/vocabulary. Downloading... 4x smaller files
Spacy models 'en_core_web_sm' not found
later on, [nltk_data] Error loading punkt: <urlopen error [Errno -3] Temporary failure in name resolution> [nltk_data] Error loading wordnet: <urlopen error [Errno -3] Temporary failure in name resolution>

I have solved it with these steps:

structured-prediction-srl-bert:

I have downloaded the structured-prediction-srl-bert.2020.12.15.tar.gz from the https://demo.allennlp.org/semantic-role-labeling (Model Card tab) - https://storage.googleapis.com/allennlp-public-models/structured-prediction-srl-bert.2020.12.15.tar.gz
I have unzipped it into ./data/structured-prediction-srl-bert.2020.12.15
The code:

pip install allennlp==2.10.0 allennlp-models==2.10.0

from allennlp.predictors.predictor import Predictor

predictor = Predictor.from_path("./data/structured-prediction-srl-bert.2020.12.15/")

bert-base-uncased

I have created a folder ./data/bert-base-uncased and there I have downloaded these files from https://huggingface.co/bert-base-uncased/tree/main
- config.json
- tokenizer.json
- tokenizer_config.json
- vocab.txt
- pytorch_model.bin

Aditionally, I had to change the "bert_model_name" from "bert-base-uncased" into a path "./data/bert-base-uncased", the earlier causes the download. This has to be done in the ./data/structured-prediction-srl-bert.2020.12.15/config.json , and there are two occurences.

python -m spacy download en_core_web_sm
python -c 'import nltk; nltk.download("punkt"); nltk.download("wordnet")'

After these steps the allennlp did not need internet anymore.

How to resolve coreferences without Internet using AllenNLP and coref-spanbert-large?

2 Answers2