I have a Spark 2.1.1 job that is running in a Mesos cluster. Spark UI is showing 32 active executors, and RDD.getNumPartitions is showing 28 partitions. But only one (random) executor is doing any work, all others are marked as completed. I added debug statements to executor code (stdout) and only one executor is showing those. Entire pipeline is structured as follows: Get list of ids -> download JSON data for each id -> parse JSON data -> save to S3.
stage 1: val ids=session.sparkContext.textFile(path).repartition(28) -> RDD[String]
//ids.getNumPartitions shows 28
stage 2: val json=ids.mapPartitions { keys =>
val urlBuilder ...
val buffer ....
keys map { key =>
val url=urlBuilder.createUrl(id) //java.net.URL
val json=url.openStream() ... //download text to buffer, close stream
(id,json.toString)
}
} -> RDD[Tuple2[String,String]]
stage 3: val output = json flatMap { t =>
val values = ... //parse JSON, get values from JSON or empty sequence if not found
values map { value => (t._1, value) }
} -> RDD[Tuple2[String,String]]
stage 4: output.saveAsTextFile("s3://...")
These are config settings for Spark binary: --driver-memory 32g --conf spark.driver.cores=4 --executor-memory 4g --conf spark.cores.max=128 --conf spark.executor.cores=4
The stage that is running on one executor only is the second one. I explicitly specified number of partitions (repartition(28)) in step one. Has anyone seen such behavior before? Thanks,
M
SOLUTION
I went the other way (see suggestion from Travis) and increased the number of partitions (after step 1) to 100. That worked, the job finished in a matter of minutes. But there is a side effect - now I have 100 partial files sitting in S3.