I'm new to PySpark and Pandas UDFs, I'm running the following Pandas UDF Function to jumble a column containing strings (For Example: an input 'Luke' will result in 'ulek')
pandas_udf("string")
def jumble_string(column: pd.Series)-> pd.Series:
return column.apply(lambda x: None if x==None else ''.join(random.sample(x, len(x))).lower())
spark_df = spark_df.withColumn("names", jumble_string("names"))
On running the above function on a large dataset I've noticed that the execution takes unusually long.
I'm guessing the .apply
function has something to do with this issue.
Is there anyway I can rewrite this function so it can effectively execute on a Big Dataset? Please Advise