2

We have a spark 2.1 streaming application with a mapWithState, enabling spark.streaming.dynamicAllocation.enabled=true. The pipeline is as follows:

var rdd_out = ssc.textFileStream()
    .map(convertToEvent(_))
    .combineByKey(...., new HashPartitioner(partitions))
    .mapWithState(stateSpec)
    .map(s => sessionAnalysis(s))
    .foreachRDD( rdd => rdd.toDF().....save(output));

The streaming app starts with 2 executors, after some time it creates new executors, as the load increases as expected. The problem is that the load is not shared by those executors.

The number of Partitions is big enough to spill over to the new executors, and the key is equally distributed, I set it up with 40+ partitions, but I can see only 8 partitions (2 executors x 4 cores each) on the mapWithState storage. I am expecting when new executors are allocated, those 8 partitions get split and assigned to the new ones, but this never happens.

Please advise.

Thanks,

zero323
  • 322,348
  • 103
  • 959
  • 935
Joe Bledo
  • 21
  • 2
  • I am pretty sure stateful transformations cannot work this way and `spark.streaming.dynamicAllocation.enabled` simply won't help you. "State" is the reference here, and since it is partitioned, it will serve as a "template". – zero323 Mar 13 '17 at 19:49
  • @zero323, can you elaborate on "template", as the state is held on a set of rdds just as any other. I would have expected it can be rebalanced. Thx – Joe Bledo Mar 14 '17 at 00:20
  • But to "rebalance" you'd have to reshuffle it so it would require an option of repartitioning state. – zero323 Mar 17 '17 at 15:33

1 Answers1

0

Apparently the answer was staring at my face al along :). RDDs as per documentation below, should inherit the upstream partitions.

   * Otherwise, we use a default HashPartitioner. For the number of partitions, if
   * spark.default.parallelism is set, then we'll use the value from SparkContext
   * defaultParallelism, otherwise we'll use the max number of upstream partitions.

The state inside a mapWithState however does not have an upstream RDD. Therefore is set to the default parallelism unless you specify the partitions directly in the state, as the example bellow.

val stateSpec = StateSpec.function(crediting.formSession _)
        .timeout(timeout)
        .numPartitions(partitions)  // <----------

var rdd_out = ssc.textFileStream()
    .map(convertToEvent(_))
    .combineByKey(...., new HashPartitioner(partitions))
    .mapWithState(stateSpec)
    .map(s => sessionAnalysis(s))
    .foreachRDD( rdd => rdd.toDF().....save(output));

Still need to figure out how to make the number of partitions dynamic, as with dynamic allocation, this should change at runtime.

Joe Bledo
  • 21
  • 2
  • "Still need to figure out how to make the number of partitions dynamic, as with dynamic allocation, this should change at runtime." this also applies to any other rdd in an streaming app. – Joe Bledo Mar 22 '17 at 13:54