What is the advantage of having Replication partition by setting the STORAGE LEVELS like MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc?
If we already have the HDFS replication, what is the use of having this one?
What is the advantage of having Replication partition by setting the STORAGE LEVELS like MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc?
If we already have the HDFS replication, what is the use of having this one?
When you persist the Spark RDD/DataSet using MEMORY_ONLY_2/MEMORY_AND_DISK_2, data doesn't go to HDFS. Its stored in the local file storage of the node where the task is running.
Replication is handled by Spark and not by HDFS. In case of failure to retrieve the persisted partitions, Spark has to recalculate partitions. Replication of 2 ensures that the persisted partitions are replicated on two nodes.
You can also get some details of the persisted partitions on the Spark UI. Under the storage tab, you can see all the persisted data. You can see node on which the data is persisted, the size of the partitions in memory(on heap/ off heap) and disk. etc
Spark RDDs/Datasets are lazily evaluated.
If two separate actions depend on the same RDD/DS then the RDD/DS will be evaluated twice which may be an expensive operation.
To reduce the likelihood of this happening we can cache/persist the RDD/DS so when it is required the second/subsequent time it is loaded from the cache.
.cache will store the RDD/DS after it has been evaluated with storage level MEMORY_AND_DISK. Alternatively .persit may be used which allows full control of the storage level.
As general rule of thumb if you are using an expensive to calculate RDD/DS more than once then consider caching it.