Spark continuous processing mode does not read all kafka topic partition

Question

I'm experimenting with Spark's Continuous Processing mode in Structured Streaming and I'm reading from a Kafka topic with 2 partitions while the Spark application has only one executor with one core.

The application is a simple one where it simply reads from the first topic and publishes on the second one. The problem is my console-consumer that reads from the second topic it sees only messages from one partition of the first topic. This means my Spark application reads only messages from one partition of the topic.

How can I make my Spark application read from both partitions of the topic?

Note

I'm asking this question for people that might run into the same issue as me

M-Doru · Accepted Answer · 2019-01-11T10:21:42.617

5

I found the answer for my question in the Spark Structured Streaming documentation in the caveats section

Basically, in the continuous processing mode spark launches long running tasks that read from one partition of the topic hence as only one task per core can run, the spark application needs to have as many cores as kafka topic partitions it reads from.

edited Jan 11 '19 at 10:21

answered Jan 10 '19 at 14:25

M-Doru

71
7

Cores or Executors? One executor can occupy one core. Your **cluster** needs to have that many cores available, and your application can be configured to take up to `cores * executors` – OneCricketeer Jan 10 '19 at 17:32
Thanks for asking! I wasn't very explicit, I mean total number of cores for the application which means `#cores/executor * #executors` assigned to the application – M-Doru Jan 11 '19 at 10:21
Ok, good to know, thx. Just interested, did you measure latency with continuous processing? – kensai Sep 16 '19 at 18:29

Spark continuous processing mode does not read all kafka topic partition

1 Answers1