I need to increase the input rate per partition for my application and I have use .set("spark.streaming.kafka.maxRatePerPartition",100)
for the config. The stream duration is 10s so I expect process 5*100*10=5000
messages for this batch. However, the input rate I received is just about 500. Can You suggest any modifications to increase this rate?

- 379
- 1
- 4
- 16
2 Answers
The stream duration is 10s so I expect process 5*100*10=5000 messages for this batch.
That's not what the setting means. It means "how many elements each partition can have per batch", not per second. I'm going to assume you have 5 partitions, so you're getting 5 * 100 = 500. If you want 5000, set maxRatePerPartition
to 1000.
From "Exactly-once Spark Streaming From Apache Kafka" (written by the Cody, the author of the Direct Stream approach, emphasis mine):
For rate limiting, you can use the Spark configuration variable
spark.streaming.kafka.maxRatePerPartition
to set the maximum number of messages per partition per batch.
Edit:
After @avrs comment, I looked inside the code which defines the max rate. As it turns out, the heuristic is a bit more complex than stated in both the blog post and the docs.
There are two branches. If backpressure is enabled alongside maxRate, then the maxRate is the minimum between the current backpressure rate calculated by the RateEstimator
object and maxRate set by the user. If it isn't enabled, it takes the maxRate defined as is.
Now, after selecting the rate it always multiplies by the total batch seconds, effectively making this a rate per second:
if (effectiveRateLimitPerPartition.values.sum > 0) {
val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
Some(effectiveRateLimitPerPartition.map {
case (tp, limit) => tp -> (secsPerBatch * limit).toLong
})
} else {
None
}

- 146,575
- 32
- 257
- 321
-
[From docs](https://spark.apache.org/docs/latest/configuration.html): _"`spark.streaming.kafka.maxRatePerPartition` Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the new Kafka"_. This is probably the source of the confusion. – Dec 07 '16 at 16:30
-
1They are wrong :). I'll create a pull request to fix that so others won't get confused. – Yuval Itzchakov Dec 07 '16 at 16:35
-
@YuvalItzchakov I'm able to consume correct number of records as described in the documentation. In my production environment I'm using kafka 0.9 and spark 1.6.2 and able to consume (`number of partitions` * `maxRatePerPartition` * `batch duration`) records. – avr Dec 07 '16 at 18:05
-
1I set spark.streaming.kafka.maxRatePerPartition but KafkaUtils.createDirectStream still gets more than the specified value, regardless whether spark.streaming.backpressure.enabled is set to true or false. How do you get KafkaUtils.createDirectStream to enforce the specified value of spark.streaming.kafka.maxRatePerPartition ? Thanks. – Michael May 09 '17 at 04:21
-
@Michael Can you give a concrete example? What is your maxRate and batch duration and what batch sizes are you seeing. – Yuval Itzchakov May 09 '17 at 06:28
-
3@YuvalItzchakov I found the issue. spark.streaming.kafka.maxRatePerPartition is actually implemented as rate per partition *per second* of the batch stream. So if the maxRatePerPartition is 500 and batch interval is 10 seconds, then the maximum rate per batch is 5,000. Therefore, I'd suggest to rename this parameter to maxRatePerPartitionPerSecond to make it clearer/ – Michael May 09 '17 at 13:30
-
4@Michael I'm not sure if you read my answer or not, but I'll quote: *Now, after selecting the rate it always multiplies by the total batch seconds, effectively making this a rate per second*. – Yuval Itzchakov May 09 '17 at 13:53
-
1@YuvalItzchakov I did not see your answer. Thanks for referring me to it. – Michael May 10 '17 at 14:50
-
Folks, I have configured both params (i.e. maxRatePerPartition and enabled the backPressure) and still no limits are respected ... any clue ? – Mário de Sá Vera Nov 17 '17 at 16:56
Property fetches N messages from a partition per seconds. If I have M partition and batch interval is B, then total messages I can see in batch is N * M * B.
There are few things you should verify
- Is your input rate is >500 for 10s.
- Is kafka topic is properly partitioned.

- 101
- 1
- 1