Where is my sparkDF.persist(DISK_ONLY) data stored?

Question

I want to understand more about the persisting strategy of hadoop out of spark.

When I persist a dataframe with the DISK_ONLY-strategy where is my data stored (path/folder...)? And where do I specify this location?

Small change: Cache on Dataset means persist with level = MEMORY AND DISK, so also cache can write to disk — T. Gawęda, Jan 24 '18 at 20:26

stefanobaghino · Answer 1 · 2018-01-25T09:40:19.940

For the short answer we can just have a look at the documentation regarding spark.local.dir:

Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

For a more in-depth understanding we can look at the code: a DataFrame (which is just a Dataset[Row]) is based on RDDs and it leverages the same persistence mechanisms. RDDs delegate this to SparkContext, which marks it for persistence. The task is then actually taken care of by several classes in the org.apache.spark.storage package: first, the BlockManager just manages chunks of data to be persisted and the policy on how to do it, delegating actual persistence to a DiskStore (when writing on disk, of course) which represents a high level interface for writing and that in turn has a DiskBlockManager for more low-level operations.

Hope you have an understanding of where to look now, so that we can move on and understand where the data is actually persisted and how we can even configure it: the DiskBlockManager invokes the helper Utils.getConfiguredLocalDirs, which for practicality I'm going to copy here (taken from the linked 2.2.1 version, the latest release at the time of writing):

def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
    val shuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
    if (isRunningInYarnContainer(conf)) {
        // If we are in yarn mode, systems can have different disk layouts so we must set it
        // to what Yarn on this system said was available. Note this assumes that Yarn has
        // created the directories already, and that they are secured so that only the
        // user has access to them.
        getYarnLocalDirs(conf).split(",")
    } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
        conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
    } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
        conf.getenv("SPARK_LOCAL_DIRS").split(",")
    } else if (conf.getenv("MESOS_DIRECTORY") != null && !shuffleServiceEnabled) {
        // Mesos already creates a directory per Mesos task. Spark should use that directory
        // instead so all temporary files are automatically cleaned up when the Mesos task ends.
        // Note that we don't want this if the shuffle service is enabled because we want to
        // continue to serve shuffle files after the executors that wrote them have already exited.
        Array(conf.getenv("MESOS_DIRECTORY"))
    } else {
        if (conf.getenv("MESOS_DIRECTORY") != null && shuffleServiceEnabled) {
        logInfo("MESOS_DIRECTORY available but not using provided Mesos sandbox because " +
            "spark.shuffle.service.enabled is enabled.")
        }
        // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
        // configuration to point to a secure directory. So create a subdirectory with restricted
        // permissions under each listed directory.
        conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
    }
}

The code, I believe, is pretty self-explanatory and well commented (and perfectly matches the contents of the documentation): when running on Yarn there is a specific policy that relies on the storage of Yarn containers, in Mesos it either uses the Mesos sandbox (unless the shuffle service is enabled) and in all other cases it will go to the location set under spark.local.dir or alternatively java.io.tmpdir (which is likely to be /tmp/).

So, if you are just playing around the data is most likely store under /tmp/, otherwise it depends a lot on your environment and configuration.

Thank you so much @stefanobaghino for the effort you put in this well structured and detailed answer. The next step for me is now to investigate our yarn-configuration loaded by _getYarnLocalDirs(conf).split(",")_. — maffe, Jan 24 '18 at 22:19
Good answer. I'd argue that _"a DataFrame is based on RDDs"_ It's not in my opinion and can just **generate** RDD lineage that Spark executes (cf. [QueryExecution.toRDD](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L92)) — Jacek Laskowski, Jan 25 '18 at 15:46
@JacekLaskowski Thanks, if you think it's a good answer, I'm fairly sure it is. :) And thanks for the comment, I actually wasn't aware of this. I'll try to edit the answer in such a way that this is more precisely conveyed. Just so that I don't write anything imprecise, the main point is still valid: the actual caching is delegated to the `RDD`, right? — stefanobaghino, Jan 25 '18 at 16:17

score 4 · Accepted Answer · answered Jan 25 '18 at 19:49

To sum it up for my YARN environment:

With the guide of @stefanobaghino i was able to just go one step further in the code where the yarn config is loaded.

val localDirs = Option(conf.getenv("LOCAL_DIRS")).getOrElse("")

which is set in the yarn.nodemanager.local-dirs option in yarn-default.xml

The background for my question is, that caused by the error

2018-01-23 16:57:35,229 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /data/1/yarn/local error, used space above threshold of 98.5%, removing from list of valid directories

my spark-job got killed sometimes and I'd like to understand whether this disk is also used for my persisted data while running the job (which is actually a massive amount).

So it turns out that this is exactly the folder where the data goes to when persisting it with a DISK-strategy.

Thanks a lot for all your helpful guidance in this problem!

Where is my sparkDF.persist(DISK_ONLY) data stored?

2 Answers2

Linked