Spark dataframe creation through already distributed in-memory data sets

Question

I am new to the Spark community. Please ignore if this question doesn't make sense.

My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec).

Explanation: I have a huge Arrow RecordBatches collection which is equally distributed on all of my worker node's memories (in plasma_store). Currently, I am collecting all those RecordBatches back in my master node, merging them, and converting them to a single Spark Dataframe. Then I apply sorting function on that dataframe.

Spark dataframe is a cluster distributed data collection.

So my question is: Is it possible to create a Spark dataframe from all that already distributed Arrow RecordBatches data collections in the worker's nodes memories? So that the data should remain in the respective worker's nodes memories (instead of bringing it to master node, merging, and then creating distributed dataframe).

Thanks!

score 0 · Answer 1 · answered Jun 16 '20 at 14:10

Yes you can store the data in a spark cache, whenever you try to get the data, it would get you from cache rather than the source.

Please utilize below kinks to understand more on cache,
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/ where does df.cache() is stored https://unraveldata.com/to-cache-or-not-to-cache/

Spark dataframe creation through already distributed in-memory data sets

1 Answers1