5

I have 5-partitions-RDD and 5 workers/executors.

How can I ask Spark to save each RDD's partition on the different worker (IP)?

Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?

Means, I can specify the number of partitions, but Spark still can cache everything on a single node.

Replication is not an option since RDD is huge.

Workarounds I have found

getPreferredLocations

RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.

As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:

class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
  
  val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")

  override def getPreferredLocations(split: Partition): Seq[String] =
    Seq(nodeIPs(split.index % nodeIPs.length))
}

SparkContext's makeRDD

SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.


Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.

P.S. More context

I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS. I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:

  1. Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
  2. Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?
Rizwan
  • 103
  • 4
  • 24
VB_
  • 45,112
  • 42
  • 145
  • 293
  • I hope you know that random `@` have no use at all :) It doesn't notify anyone who isn't already active in a given thread. `spark.deploy.spreadOut` typically worked for me, but as far as I know it is not portable and doesn't provide any guarantees. What is the end goal here? You cache the data, assume no worker faults, but what comes next? – zero323 Mar 03 '17 at 01:08
  • Have you actually observed spark putting all your data on a single node, or is this just theoretical? – puhlen Mar 03 '17 at 02:08
  • @zero323 sorry for that. But that `@` wasn't so random, I just choose top 5 Spark users on stack overflow :) Please look at my P.S. section of the question. Do I really need `spark.deploy.spreadOut`? – VB_ Mar 03 '17 at 09:21
  • @puhlen That's just theoretical, and I haven't done any tests of this kind. Should I worry about this at all? Please look at `P.S. More Context` section of my question – VB_ Mar 03 '17 at 09:22
  • Don't worry :) My only point is that @ doesn't work. Reading "more context" - at this scale I wouldn't worry much about skews and not even try micromanaging. What troubles me more is 10,000 RDDs (I haven't tried so it just a hunch but it may drive some components, including LRU tracking, crazy) and full text search At the first glance I would rather look at some in-memory data grid, and maybe Succinct. – zero323 Mar 03 '17 at 12:58

0 Answers0