1

I have a basic question regarding the flow of control in a kafka stream application. If there are two source topics A & B. Lets suppose that A has records with timestamps that are earlier than B. Is there a guarantee of the order in which the records would be processed by the streaming application?

I did a very rudimentary test and peeked at the records when they were getting consumed and printed the instant at which they were being processed via a simple sout of Instant.now

KStream<String, String> akStream= builder.stream("A",
        Consumed.with(Serdes.String(), Serdes.String()).withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST))
        .peek((s, string) -> System.out.println("Topic A at " + Instant.now() ));

KStream<String, String> bkStream= builder.stream("B",
        Consumed.with(Serdes.String(), Serdes.String()))
        .peek((s, string) -> System.out.println("Topic B " + Instant.now()));

These are the begin and end timestamps for the records in the topics

A : 2020-03-27 14:36:04 (epoch: 1585316164843) 2020-03-27 14:34:02 (epoch: 1585316042569)
B : 2020-03-30 11:04:17 (epoch: 1585559057167) 2020-03-17 14:44:38 (epoch: 1584452678527)

Topic B records get picked up before Topic A. Sysout shows all records form topic B Can someone help in understanding this ? I would like to use this understanding when writing streaming application with multiple input sources.

Thanks in advance

Alexander Petrov
  • 9,204
  • 31
  • 70
Sumit Baurai
  • 233
  • 1
  • 12

2 Answers2

1

The way you have build your streams, each stream exists alone for itself, there is no ordering guarntee.

With regards of processing the records based on timestamp. This you can do only within a timewindow. For example if you have two topic A and B you can join them and withing a timewindow you can order the events.

<VO,VR> KStream<K,VR> join​(KStream<K,VO> otherStream,
                           ValueJoiner<? super V,? super VO,? extends VR> joiner,
                           JoinWindows windows)
Alexander Petrov
  • 9,204
  • 31
  • 70
  • Thanks Alexander !! In my usecase, topics are populated not so often, sometimes over a period of days, therefore, i am not looking at windowing as I cannot be sure of when the record comes in. The example that I posted was a simple one to understand the ordering. Typically i would be joining these topics and then use that later in the topology for enrichment via a state store (for example). I want to avoid the situation where the state store itself is not populated because the topics feeding the state store are not processed yet. – Sumit Baurai Mar 30 '20 at 14:02
0

It depends. In general, there are not guarantees about processing order between different topics. There is one exception though: if a single task processes data from different topics, than records will be processed in timestamp order. However, it's a best effort approach; as of Kafka Streams 2.3, those ordering guarantees got improved and you can influence them use max.task.idle.ms configuration.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks Matthias !! Here is what I have tried meanwhile 1. Consume the topic that I want to use as a state store with a time stamp extractor that basically returns 0. 2. Consume the topic whose records I want to transform 3. Within the transformer, access the store from step 1 Finding : Even thought the state store is initialized, it does not have any values as peekNextKey returns a No Such element exception. Why is the state store not populated and just initialized?? – Sumit Baurai Apr 02 '20 at 09:54
  • Do both streams have the same number of partitions and the same keys? -- Records across different partitions should be processed in timestamp order and thus, using a timestamp extractor for one topic that returns 0 should ensure that this topic is processed first. – Matthias J. Sax Apr 02 '20 at 20:52