How to make and efficiently run "void" PySpark user defined function (UDF) that returns nothing?

Question

Given the available methods for specifying user defined functions in PySpark:

Row-at-a-time native PySpark UDFs
Pandas UDFs that make use of Apache Arrow

How could one create and run on a dataframe a user defined function that does not return anything without having to create a new column?

Example: say you wanted to parallelize loading a dataframe column into some external persistence store. I.e. instead of writing the whole dataframe to HDFS, use one field as a key and another as a value for transfer row-by-row into a blob store such as s3.

score 0 · Answer 1 · edited Apr 13 '22 at 22:14

0

In such case you wouldn't use UDF at all. There unsuitable for the task for a number of reasons. Instead you just use foreach

foreach(f)

Applies the f function to all Row of this DataFrame.

or foreachPartition

foreachPartition(f)

Applies the f function to each partition of this DataFrame.

edited Apr 13 '22 at 22:14

ijoseph

6,505
4
26
26

answered Jan 17 '19 at 18:48

1

Since this is PySpark, this will come at the cost of serializing each row from Java workers to Python workers no? Wouldn't it be better to do this via a Pandas UDF with arrow if possible to avoid this overhead? – Jake Spracher Jan 17 '19 at 19:28

How to make and efficiently run "void" PySpark user defined function (UDF) that returns nothing?

1 Answers1