1

I've a Spring boot based KStreams application where I am joining data across multiple topics. What is/are best practice(s) to handle a situation when there is a delay in one topic? I've read links such as How to manage Kafka KStream to Kstream windowed join? and others.

Here is my sample code(Spring Boot App) to produce mock data to 2 topics - Employee and Finance. Code for employee topic below:

private void sendEmpData() {
    IntStream.range(0, 1).forEach(index -> {
        EmployeeKey key = new EmployeeKey();
        key.setEmployeeId(1);

        Employee employee = new Employee();
        employee.setDepartmentId(1000);
        employee.setEmployeeFirstName("John);
        employee.setEmployeeId(1);
        employee.setEmployeeLastName("Doe");

        kafkaTemplateForEmp.send(EMP_TOPIC, key, employee);
    });
}

Likewise for the finance topic:

private void sendFinanceData() {
    IntStream.range(0, 1).forEach(index -> {
        FinanceKey key = new FinanceKey();
        key.setEmployeeId(1);
        key.setDepartmentId(1000);

        Finance finance = new Finance();
        finance.setDepartmentId(1000);
        finance.setEmployeeId(1);
        finance.setSalary(2000);

        kafkaTemplateForFinance.send(FINANCE_TOPIC, key, finance);
    });
}

The timestamp type associated with these records is TimeStampType.CREATE_TIME which I am assuming to be the same as event time in Streams.

I've a simple KStreams app which rekeys the finance topic to have the finance stream key match to employee stream key and then do the join as below:

employeeKStream.join(reKeyedStream,
            (employee, finance) -> new EmployeeFinance(employee.getEmployeeId(),
                    employee.getEmployeeFirstName(),
                    employee.getEmployeeLastName(),
                    employee.getDepartmentId(),
                    finance.getSalary(),
                    finance.getSalaryGrade()),
            JoinWindows.of(windowRetentionTimeMs), //30 seconds
            Joined.with(
                    employeeKeySerde,
                    employeeSerde,
                    financeSerde)).to(outputTopic, Produced.with(employeeKeySerde, employeeFinanceSerde));

If a record with matching key arrives more than 30 seconds later in finance topic, then the join doesn't happen. Any insights on how to address this would be helpful. Thank you in advance.

P.S.: This data is a work of fiction. If it matches your department Id/salary, its merely coincidental. :)

user123
  • 281
  • 1
  • 3
  • 16

1 Answers1

0

If a record with matching key arrives more than 30 seconds later in finance topic, then the join doesn't happen.

By default, Kafka Streams uses a grace period of 24h, hence, even if there is lag or out-of-order data, your join should work. Note that lag in Kafka always refers to the read path!

arrives more than 30 seconds later in finance topic

However, I think you don't really mean that you have lag (in the sense that you fall back reading), but your upstream write is delayed -- for this case, the event time may just be assigned incorrectly:

Note, that when writing to a Kafka topic and you don't specify the timestamp explicitly, the producer will assign the timestamp when send() is called -- not when the ProducerRecord instance is created. If you want to assign a timestamp when the ProducerRecord is created, you need to pass in the timestamp you want to assign into the constructor manually (not sure if Spring boot allows you to do this).

As an alternative (in case you cannot set the record timestamp explicitly), you could embed the timestamp in the value (this of course requires that you change your Employee and Finance classes. When processing this data with Kafka Streams, you can use in a custom TimestampExtractor to get your custom timestamp instead of the record timestamp.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137