As fas as I know , the spark use lazy computation meaning if the action
is not called, nothing would ever never happen .
And one way I know is using collect
method get spark working , however when I read the article it says :
Usually, collect() is used to retrieve the action output when you have very small result set and calling collect() on an RDD/DataFrame with a bigger result set causes out of memory as it returns the entire dataset (from all workers) to the driver hence we should avoid calling collect() on a larger dataset.
And I actually have udf
that returns NullType()
@udf
def write_something():
#write something to dir
so I do not want to use collect()
,cause it might cause OOM
as mentioned above.
So in my case , what is the best way to do this in my case ? Thanks !