1

As fas as I know , the spark use lazy computation meaning if the action is not called, nothing would ever never happen . And one way I know is using collect method get spark working , however when I read the article it says :

Usually, collect() is used to retrieve the action output when you have very small result set and calling collect() on an RDD/DataFrame with a bigger result set causes out of memory as it returns the entire dataset (from all workers) to the driver hence we should avoid calling collect() on a larger dataset.

And I actually have udf that returns NullType()

@udf
def write_something():
    #write something to dir

so I do not want to use collect() ,cause it might cause OOM as mentioned above.

So in my case , what is the best way to do this in my case ? Thanks !

Pro_gram_mer
  • 749
  • 3
  • 7
  • 20

1 Answers1

0

You can use Dataframe.foreach:

df.foreach(lambda x: None)

The foreach action will trigger the excecution of the whole DAG of df while keeping all data on their respective executors.

The pattern foreach(lambda x: None) is mainly used for debugging purposes. An option might be to remove the udf and put its logic into the function that is called by foreach.

werner
  • 13,518
  • 6
  • 30
  • 45