I try to process data coming from BigQuery.
I created a pipeline with Apache Beam as below:
nlp = fr_core_news_lg.load()
class CleanText(beam.DoFn):
def process(self, row):
row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
yield row
class LemmaText(beam.DoFn):
def process(self, row):
doc = nlp(row['descriptioncleaned'], disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])
row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
yield row
with beam.Pipeline(runner="direct", options=options) as pipeline:
soft = pipeline \
| "GetRows" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygs") \
| "CleanText" >> beam.ParDo(CleanText()) \
| "LemmaText" >> beam.ParDo(LemmaText()) \
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq', custom_gcs_temp_location="gs://mygs", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")
Basically, it loads data from my BigQuery table, clean one of the columns (of type string), and lemmatize it using Spacy Lemmatizer. I have approx. 8M lines and each string is pretty big, approx. 300 words.
At the end it all sums up and takes more than 15 hours to complete. And we have to run it everydays.
I don't really understand why it is taking so long to run on DataFlow which is supposed to in a parallelized way.
I already used pipe
from Spacy but I can't really make it work with Apache Beam.
Is there a way to speed up Spacy processing using DataFlow or parallelize it better ?