How should I load a memory-intensive helper object per-worker in dask distributed?

Question

I am currently trying to parse a very large number of text documents using dask + spaCy. SpaCy requires that I load a relatively large Language object, and I would like to load this once per worker. I have a couple of mapping functions that I would like to apply to each document, and I would hopefully not have to reinitialize this object for each future / function call. What is the best way to handle this?

Example of what I'm talking about:

def text_fields_to_sentences(
    dataframe:pd.DataFrame,
    ...
)->pd.DataFrame:
  # THIS IS WHAT I WOULD LIKE TO CHANGE
  nlp, = setup_spacy(scispacy_version)

  def field_to_sentences(row):
    result = []
    doc = nlp(row[text_field])
    for sentence_tokens in doc.sents:
      sentence_text = "".join([t.string for t in sentence_tokens])
      r = text_data.copy()
      r[sentence_text_field] = sentence_text
      result.append(r)
    return result
  series = dataframe.apply(
    field_to_sentences,
    axis=1
  ).explode()
  return pd.DataFrame(
      [s[new_col_order].values for s in series],
      columns=new_col_order
  )

input_data.map_partitions(text_fields_to_sentences)

score 1 · Answer 1 · answered Sep 02 '19 at 18:41

1

You could create the object as a delayed object

corpus = dask.delayed(make_corpus)("english")

And then use this lazy value in place of the full value:

df = df.text.apply(parse, corpus=corpus)

Dask will call make_corpus once on one machine and then pass it around to the workers as it is needed. It will not recompute any task.

answered Sep 02 '19 at 18:41

MRocklin

55,641
23
163
235

1

Thanks for pointing me in that direction. But what should I do when that object can't be picked? – JSybrandt Sep 03 '19 at 16:14

How should I load a memory-intensive helper object per-worker in dask distributed?

1 Answers1