A. Background:
I have to text manipulation using python (like concatenation , convert to spacy doc , get verbs from the spacy doc etc) for 1 million records
Each record takes 1 sec, meaning it will take 10 days for 1 million records !
There's no ML model I am using. It's just basic text manipulation using python
B. High level problem :
Speed up the above run using concurrent jobs that databricks has.
C. I have been recommended the below steps but unsure of how to proceed.
I have been recommended to create a table in Databricks for my input data (1 million rows x 5 columns).
Add 2 additional columns - Result Column and Status Column (with entries NaN/InProgress/Completed) to the table
Split the table into 10 jobs such that the records with Status=NaN are sent for processing (python script), and the Status is updated to InProgress/Completed depending upon the script's completion for that record..
Have been asked to use spark dataframe in python script instead of pandas dataframe.
D. What I have tried already:
I have simply changed my Python code from Python Pandas to Pyspark.Pandas (which is a Pandas API for Spark and is supposed to work similar to Spark Dataframe.. )
But with above ^ , I have not been able to achieve much improvements - My python code executed 30% faster for 300 records , but I think that's just because of a better Databricks processor (Azure cloud)
For larger records , my databricks noteboook gives a Pickling Error which I have described in detail in the question