I am currently trying to parse a very large number of text documents using dask + spaCy. SpaCy requires that I load a relatively large Language
object, and I would like to load this once per worker. I have a couple of mapping functions that I would like to apply to each document, and I would hopefully not have to reinitialize this object for each future / function call. What is the best way to handle this?
Example of what I'm talking about:
def text_fields_to_sentences(
dataframe:pd.DataFrame,
...
)->pd.DataFrame:
# THIS IS WHAT I WOULD LIKE TO CHANGE
nlp, = setup_spacy(scispacy_version)
def field_to_sentences(row):
result = []
doc = nlp(row[text_field])
for sentence_tokens in doc.sents:
sentence_text = "".join([t.string for t in sentence_tokens])
r = text_data.copy()
r[sentence_text_field] = sentence_text
result.append(r)
return result
series = dataframe.apply(
field_to_sentences,
axis=1
).explode()
return pd.DataFrame(
[s[new_col_order].values for s in series],
columns=new_col_order
)
input_data.map_partitions(text_fields_to_sentences)