Python Spacy KeyError: "[E018] Can't retrieve string for hash

Question

I am trying to make my code run on Raspberry Pi 4 and been stuck on this error for hours. This code segment throws an error on it but runs perfectly on windows with the same project

def create_lem_texts(data): # as a list def sent_to_words(sentences): for sentence in sentences: yield (gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations

data_words = list(sent_to_words(data))
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)  # higher threshold fewer phrases.
bigram_mod = gensim.models.phrases.Phraser(bigram)

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    print(os.getcwd())
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
print(os.getcwd())
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

return data_lemmatized

This code is in turned called by this function:

def assign_topics_tweet(tweets):
owd = os.getcwd()
print(owd)
os.chdir('/home/pi/Documents/pycharm_project_twitter/topic_model/')
print(os.getcwd())
lda = LdaModel.load("LDA26")
print(lda)
id2word = Dictionary.load('Id2Word')
print(id2word)
os.chdir(owd)
data = create_lem_texts(tweets)
corpus = [id2word.doc2bow(text) for text in data]
topics = []
for tweet in corpus:
    topics_dist = lda.get_document_topics(tweet)
    topics.append(topics_dist)
return topics

And here is the error message

    Traceback (most recent call last):
  File "/home/pi/Documents/pycharm_project_twitter/Twitter_Import.py", line 193, in <module>
    main()
  File "/home/pi/Documents/pycharm_project_twitter/Twitter_Import.py", line 169, in main
    topics = assign_topics_tweet(data)
  File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 238, in assign_topics_tweet
    data = create_lem_texts(tweets)
  File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 76, in create_lem_texts
    data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
  File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 67, in lemmatization
    texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 67, in <listcomp>
    texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  File "token.pyx", line 871, in spacy.tokens.token.Token.lemma_.__get__
  File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '18446744073541552667'. This usually refers to an issue with the `Vocab` or `StringStore`."

Process finished with exit code 1

I tried reinstalling spacy and the en model, running it directly on pi, spacy versions are the same on both my windows machine and on the Pi. And there is basically no information online on this error

score 2 · Accepted Answer · answered May 31 '20 at 21:36

2

After three days of testing problem was solved by simply installing an older version of Spacy 2.0.1

answered May 31 '20 at 21:36

I ball was rawt

61
4

May I ask which versions of spacy you tried before? Was 2.0.1 the latest version that worked for you? I get exactly the same error on a BananaPi M1 with Armbian while on Windows the code runs fine. I am using spacy 2.3.5 with en_core_web_md 2.3.1. – S818 Jan 31 '21 at 17:39
2.0.1 didn't work for me. I got a gcc error "spacy/parts_of_speech.cpp: No such file or directory" like so: https://github.com/explosion/spaCy/issues/1992 In the end, spacy 2.0.7 worked for me. – S818 Jan 31 '21 at 20:45

Python Spacy KeyError: "[E018] Can't retrieve string for hash

1 Answers1