1

I use kafka for voting app, where user can choose candidate and change the selection during 1 hour time-range.

Since this is suitable for KTable, I use kafka stream app. However, there is time-range requirement, means I need to groupBy().count() only for specific time-range (e.g. from 10:00-11:00).

How can I achieve this using Kafka Stream API?
As far as I know, Kafka (I use Kafka 2.3) put published timestamp on metadata, but how to access it? I'm thinking of using .filter() based on timestamp

Also I see windowing documentation but it seems the time is relative (e.g. last 1 hour) instead of fixed (10:00-11:00).

Thank you

Timothy
  • 855
  • 1
  • 13
  • 29
  • Just for clarification, you want to count votes but for a specific time range only and discard/ignore all records before the start time and those after the end time correct? I'm assuming you want to measure from timestamps and not wallclock time. – bbejeck Apr 29 '20 at 14:46
  • Yes. What do you mean by wallclock? Timestamp in kafka (published date) is good,but how can i extract and use it in kafka stream? – Timothy Apr 29 '20 at 16:07
  • By wallclock time I meant the actual time of day not the timestamp of the record. But that doesn't matter in this case because you just want the timestamp of the record itself. I've posted a possible solution below. – bbejeck Apr 29 '20 at 17:34
  • Windows are aligned to the hour. From the docs `For example, tumbling windows with a size of 5000ms have predictable window boundaries [0;5000),[5000;10000),...` – Matthias J. Sax May 02 '20 at 20:38

2 Answers2

2

Timothy,

To access the timestamp of the record, you can use a transformValues() operation. The ValuesTransformer you supply has access to the ProcessorContext and you can call ProcessorContex.timestamp() in the ValueTransformer.transform() method. If the timestamp is within the desired range, return the record otherwise return null. Then add a filter() after the transformValues() to remove the records you've rejected.

Here's an example I think will work

class GroupByTimestampExample {

  public static void main(String[] args) {

    final StreamsBuilder builder = new StreamsBuilder();
    // You need to update the the time fields these are just placeholders
    long earliest = Instant.now().toEpochMilli();
    long latest = Instant.now().toEpochMilli() + (60 * 60 * 1000);

    final ValueTransformerSupplier<String, String> valueTransformerSupplier = new TimeFilteringTransformer(earliest, latest);

    final KTable<String, Long> voteTable = builder.<String, String>stream("topic")
                                            .transformValues(valueTransformerSupplier)
                                            .filter((k, v) -> v != null)
                                            .groupByKey()
                                            .count();

  }




  static final class TimeFilteringTransformer implements ValueTransformerSupplier<String, String> {

    private final long earliest;
    private final long latest;

    public TimeFilteringTransformer(final long earliest, final long latest) {
      this.earliest = earliest;
      this.latest = latest;
    }

    @Override
    public ValueTransformer<String, String> get() {
      return new ValueTransformer<String, String>() {
        private ProcessorContext processorContext;

        @Override
        public void init(ProcessorContext context) {
          processorContext = context;
        }

        @Override
        public String transform(String value) {
         long ts = processorContext.timestamp();
         if (ts >= earliest && ts <= latest) {
            return value;
         }
         return null;
        }

        @Override
        public void close() {

        }
      };
    }
  }
}

Let me know how it goes.

bbejeck
  • 1,310
  • 8
  • 7
  • 1
    Yes it works fine. I wonder why is it though. The [javadoc](https://docs.confluent.io/current/streams/javadocs/org/apache/kafka/streams/processor/ProcessorContext.html#timestamp--) says it returns current timestamp. Is this "event-time" on your book Kafka Stream in Action? (great book anyway, just read chapter 4 today, but still confuse about timestamp & state store part) – Timothy May 02 '20 at 03:16
  • Keep reading the JavaDocs: `If it is triggered while processing a record streamed from the source processor, timestamp is defined as the timestamp of the current input record; the timestamp is extracted from ConsumerRecord by TimestampExtractor` – Matthias J. Sax May 02 '20 at 20:40
1

Actually Tumbling window is Fixed-size, non-overlapping, gap-less windows. In your use case the window duration is one hour, and as your example, a window 10:00-11:00 will be created (start inclusive, end exclusive):

kStream
    .groupByKey()
    .windowedBy(TimeWindows.of(Duration.ofHours(1)))
    .count();
Tuyen Luong
  • 1,316
  • 8
  • 17