1

A. Background:

  1. I have to text manipulation using python (like concatenation , convert to spacy doc , get verbs from the spacy doc etc) for 1 million records

  2. Each record takes 1 sec, meaning it will take 10 days for 1 million records !

  3. There's no ML model I am using. It's just basic text manipulation using python

B. High level problem :

Speed up the above run using concurrent jobs that databricks has.

C. I have been recommended the below steps but unsure of how to proceed.

  1. I have been recommended to create a table in Databricks for my input data (1 million rows x 5 columns).

  2. Add 2 additional columns - Result Column and Status Column (with entries NaN/InProgress/Completed) to the table

  3. Split the table into 10 jobs such that the records with Status=NaN are sent for processing (python script), and the Status is updated to InProgress/Completed depending upon the script's completion for that record..

  4. Have been asked to use spark dataframe in python script instead of pandas dataframe.

D. What I have tried already:

  1. I have simply changed my Python code from Python Pandas to Pyspark.Pandas (which is a Pandas API for Spark and is supposed to work similar to Spark Dataframe.. )

  2. But with above ^ , I have not been able to achieve much improvements - My python code executed 30% faster for 300 records , but I think that's just because of a better Databricks processor (Azure cloud)

  3. For larger records , my databricks noteboook gives a Pickling Error which I have described in detail in the question

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
newbie101
  • 65
  • 7

0 Answers0