0

Given the available methods for specifying user defined functions in PySpark:

  • Row-at-a-time native PySpark UDFs
  • Pandas UDFs that make use of Apache Arrow

How could one create and run on a dataframe a user defined function that does not return anything without having to create a new column?

Example: say you wanted to parallelize loading a dataframe column into some external persistence store. I.e. instead of writing the whole dataframe to HDFS, use one field as a key and another as a value for transfer row-by-row into a blob store such as s3.

Jake Spracher
  • 805
  • 7
  • 10

1 Answers1

0

In such case you wouldn't use UDF at all. There unsuitable for the task for a number of reasons. Instead you just use foreach

foreach(f)

Applies the f function to all Row of this DataFrame.

or foreachPartition

foreachPartition(f)

Applies the f function to each partition of this DataFrame.

ijoseph
  • 6,505
  • 4
  • 26
  • 26
  • 1
    Since this is PySpark, this will come at the cost of serializing each row from Java workers to Python workers no? Wouldn't it be better to do this via a Pandas UDF with arrow if possible to avoid this overhead? – Jake Spracher Jan 17 '19 at 19:28