1

I'm new to PySpark and Pandas UDFs, I'm running the following Pandas UDF Function to jumble a column containing strings (For Example: an input 'Luke' will result in 'ulek')

pandas_udf("string")
def jumble_string(column: pd.Series)-> pd.Series:
  return column.apply(lambda x: None if x==None else ''.join(random.sample(x, len(x))).lower()) 

spark_df = spark_df.withColumn("names", jumble_string("names"))

On running the above function on a large dataset I've noticed that the execution takes unusually long.

I'm guessing the .apply function has something to do with this issue.

Is there anyway I can rewrite this function so it can effectively execute on a Big Dataset? Please Advise

The Singularity
  • 2,428
  • 3
  • 19
  • 48

2 Answers2

1

As the .apply method is not vectorized, the given operation is done by looping through the elements which slows down the execution as the data size becomes large.

For small sized data, the time difference is usually negligible. However, as the size increases, the difference starts to become noticeable. We are likely to deal with vast amount of data so time should always be taken into consideration.

You can read more about Apply vs Vectorized Operations here.

Therefore I decided to use a list comprehension which did increase my performance marginally.

@pandas_udf("string")
def jumble_string(column: pd.Series)-> pd.Series:
  return pd.Series([None if x==None else ''.join(random.sample(x, len(x))).lower() for x in column])
The Singularity
  • 2,428
  • 3
  • 19
  • 48
0

There is no implemented function in Spark that jumbles the strings in a column, so we are forced to resort to UDFs or Pandas UDFs.

Your solution is actually quite nice; perhaps we can improve it by removing the .apply method from the series and using only base Python on a string for each row.

@pandas_udf("string")
def jumble_string_new(column: pd.Series)-> pd.Series:
  x = column.iloc[0]   # each pd.Series is made only of one element
  if x is None:
    return pd.Series([None])
  else:
    return pd.Series([''.join(random.sample(x, len(x))).lower()])

The results are the same to your function; however, I've not been able to test it on a very large dataframe. Try it yourself and see whether it's computationally more efficient.

Ric S
  • 9,073
  • 3
  • 25
  • 51
  • I'll definitely try this and get back to you – The Singularity Aug 30 '21 at 12:44
  • I tried running it on a dataset with 500 rows using `spark_df.withColumn("first_name", shuffle_string("first_name"))` and got the error `PythonException: 'RuntimeError: Result vector from pandas_udf was not the required length: expected 500, got 1'` – The Singularity Aug 30 '21 at 12:51
  • You are getting this error because this kind of Pandas UDFs (SCALAR type) require a pd.Series as output, not a single string. Btw, what is `shuffle_string`? Where is it defined? – Ric S Aug 30 '21 at 13:15
  • My bad, `shuffle_string` and `jumble_string` are the same – The Singularity Aug 30 '21 at 13:16
  • I tried my proposed `jumble_string_new` on a very small dataset and it worked smoothly. Are you sure that the input and output of the Pandas UDF are actually pd.Series? – Ric S Aug 30 '21 at 13:20
  • Yup! I executed it on the same dataframe as `jumble_string`. Is `x = column.iloc[0]` causing the error? – The Singularity Aug 30 '21 at 13:24
  • `.iloc[0]` should take the first element of the Series. Since the Pandas UDF is applied to every row, it will take the element of that row – Ric S Aug 30 '21 at 14:00
  • So your code is only designed to jumble the first row value in column? As opposed to the whole column? – The Singularity Aug 30 '21 at 14:11
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236566/discussion-between-ric-s-and-luke). – Ric S Aug 30 '21 at 14:12