2

Is it possible to notify the Consumer, once the Producer publish all the data to Kafka topic?

There are possibilities the same data( with some unique field) is available in multiple partitions, so i need to group the data and do some calculation.

I thought of using Sliding window for this, but still the problem is we don't know whether the Producer is completed publishing the data?

The amount of messages is around 50K, Does Kafka can handle 50K messages[Single partition] in seconds if we have brokers with better configurations?

Currently, we are planning to have multiple partitions to split the data based on Default Partitioner.

Any efficient way to handle this?

Update:

Every fifteen minutes once, the producer gets the data and it start publish the data to Kafka topic, i am sure this is a use case for batch, but this is our current design.

Shankar
  • 8,529
  • 26
  • 90
  • 159
  • Not sure what you mean by "done" when you are talking about streams. Isn't the whole point of streams that they are of indefinite length? If your producer is generating batches of messages and you care about batch boundaries, perhaps you could post an "end of batch" message. – Joe Pallas Nov 13 '16 at 02:05
  • @JoePallas: I dont get the point, what do you mean by post an "end of batch" message. – Shankar Nov 13 '16 at 03:45
  • 1
    The producer knows when it has finished processing a batch but the consumer does not know if it has seen all the messages in the batch. If the producer publishes a special "end of batch" message after all the data for the batch has been published, the consumer can wait until it sees that before processing the batch. That may get more complicated if you have multiple partitions; markers would have to go to every partition. – Joe Pallas Nov 13 '16 at 19:04

1 Answers1

1

Spark Streaming doesn't work like that. The way it does work is of an infinite stream of data flowing in and getting processed at each batch interval. This means that if you want to signal a logical "end of batch", you'll need to send a message indicating that this batch of data is over, allowing you to send the processed messages to an output sink of your desire.

One way you can achieve this is by using stateful streams which aggregate data across batches and allow you to keep state between batch intervals.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thanks, can you give more info or link on Stateful streams, so i can try something.. – Shankar Nov 16 '16 at 09:03
  • @Shankar You can read [this blog post](http://asyncified.io/2016/07/31/exploring-stateful-streaming-with-apache-spark) (Disclaimer: I am the author). – Yuval Itzchakov Nov 16 '16 at 09:07