What is Right way to get spark executing udf

Question

As fas as I know , the spark use lazy computation meaning if the action is not called, nothing would ever never happen . And one way I know is using collect method get spark working , however when I read the article it says :

Usually, collect() is used to retrieve the action output when you have very small result set and calling collect() on an RDD/DataFrame with a bigger result set causes out of memory as it returns the entire dataset (from all workers) to the driver hence we should avoid calling collect() on a larger dataset.

And I actually have udf that returns NullType()

@udf
def write_something():
    #write something to dir

so I do not want to use collect() ,cause it might cause OOM as mentioned above.

So in my case , what is the best way to do this in my case ? Thanks !

What are you trying to reach? Just log some data from workers during execution or write the data into a path? — Kafels, Jul 07 '21 at 14:07
@Kafels I am new to spark , not sure if this can really help me speed my file-writing process. — Pro_gram_mer, Jul 08 '21 at 04:41

score 0 · Answer 1 · answered Jul 07 '21 at 17:25

You can use Dataframe.foreach:

df.foreach(lambda x: None)

The foreach action will trigger the excecution of the whole DAG of df while keeping all data on their respective executors.

The pattern foreach(lambda x: None) is mainly used for debugging purposes. An option might be to remove the udf and put its logic into the function that is called by foreach.

What is Right way to get spark executing udf

1 Answers1