Persist in memory not working in Spark

Question

I am trying the persist feature in Spark to persist the data in memory and do computations on it. I am under the assumption that storing the data in memory would make the computations faster for iterative algorithms such as K-means clustering in MLlib.

    val data3 = sc.textFile("hdfs:.../inputData.txt")
    val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
    parsedData3.persist(MEMORY_ONLY)

The call to persist throws the following error:

    scala> parsedData3.persist(MEMORY_ONLY)
    <console>:17: error: not found: value MEMORY_ONLY
                  parsedData3.persist(MEMORY_ONLY)

Could someone help me with how to correctly use persist to save a data in memory for use in an iterative algorithm?

`error: not found: value MEMORY_ONLY` - did you actually read this?! ;) — samthebest, Jul 17 '14 at 10:10
realize this is not a Java question for for the Java folks reading this, don't forget to put the parenthesis on the end: StorageLevel.MEMORY_ONLY_SER() and use import org.apache.spark.storage.StorageLevel; — JimLohse, Jan 08 '16 at 15:26

score 17 · Accepted Answer · answered Jul 17 '14 at 08:14

17

If you look at the signature of rdd.persist being: def persist(newLevel: StorageLevel): this.type you can see that it takes a value of type 'StorageLevel', so the correct way to call persist in your example would be:

parsedData3.persist(StorageLevel.MEMORY_ONLY)

The companion object of StorageLevel defines these constants, so bringing it into context will allow you to use the constant directly (as in your code)

import org.apache.spark.storage.StorageLevel._
...
parsedData3.persist(MEMORY_ONLY)  // this also works

answered Jul 17 '14 at 08:14

maasg

37,100
11
88
115

or `parsedData3.persist()` since `MEMORY_ONLY` is the default – aaronman Jul 17 '14 at 18:15
1

`MEMORY_ONLY` is not the default anymore as of spark 2.3.1 – jangorecki Sep 05 '18 at 12:02

Persist in memory not working in Spark

1 Answers1