Given the available methods for specifying user defined functions in PySpark:
- Row-at-a-time native PySpark UDFs
- Pandas UDFs that make use of Apache Arrow
How could one create and run on a dataframe a user defined function that does not return anything without having to create a new column?
Example: say you wanted to parallelize loading a dataframe column into some external persistence store. I.e. instead of writing the whole dataframe to HDFS, use one field as a key and another as a value for transfer row-by-row into a blob store such as s3.