Most efficient way to run spacy lemmatizer with Dataflow

Question

I try to process data coming from BigQuery.

I created a pipeline with Apache Beam as below:

nlp = fr_core_news_lg.load()

class CleanText(beam.DoFn):
  def process(self, row):
    row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
    yield row

class LemmaText(beam.DoFn):
  def process(self, row):
    doc = nlp(row['descriptioncleaned'], disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])
    row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
    yield row

with beam.Pipeline(runner="direct", options=options) as pipeline:
  soft = pipeline \
  | "GetRows" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygs") \
  | "CleanText" >> beam.ParDo(CleanText()) \
  | "LemmaText" >> beam.ParDo(LemmaText()) \
  | 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq', custom_gcs_temp_location="gs://mygs", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")

Basically, it loads data from my BigQuery table, clean one of the columns (of type string), and lemmatize it using Spacy Lemmatizer. I have approx. 8M lines and each string is pretty big, approx. 300 words.

At the end it all sums up and takes more than 15 hours to complete. And we have to run it everydays.

I don't really understand why it is taking so long to run on DataFlow which is supposed to in a parallelized way.

I already used pipe from Spacy but I can't really make it work with Apache Beam.

Is there a way to speed up Spacy processing using DataFlow or parallelize it better ?

It averagely takes 6.75 ms to process a row (15 hours for 8M words) while each row contains ~300 words to be lemmatized. This sounds reasonable. To further increase the parallelism, you may increase the worker counts. Depending on your current setup, if autoscaling doesn't scale up as expected, you may use a smaller sized machine type to keep the CPUs busy and indirectly allow the Dataflow job to scale up more workers. — ningk, Jan 05 '22 at 20:08

score 1 · Answer 1 · answered Jan 06 '22 at 05:29

I don't know anything about Dataflow, but some observations about your spaCy usage...

You're using the lemmatizer without the tagger. That's comparatively fast but low quality, because good lemmas rely on part of speech. But if you're doing that you should use nlp.blank("fr") and add the lemmatizer to it, as otherwise the tok2vec encoding layer will still be run despite not being used, and that's probably slower than the lemmatizer. (It's also not clear to me how often the spacy.load call is executed, but the call to blank is much faster.

I'm not sure you can map it to Dataflow, but you want to use nlp.pipe if possible for speed. On the other hand, if you're not using tok2vec or any statistical components it probably won't make much difference. Also see the speed FAQ.

Most efficient way to run spacy lemmatizer with Dataflow

1 Answers1