reading only specific messages from kafka topic

Question

Scenario:

I am writing data JSON object data into kafka topic while reading I want to read an only specific set of messages based on the value present in the message. I am using kafka-python library.

sample messages:

{flow_status: "completed", value: 1, active: yes}
{flow_status:"failure",value 2, active:yes}

Here I want to read only messages having flow_Status as completed.

I have created kafka customer who can do that with the help of spring kafka. You may get some idea when you read this blog https://rcvaram.medium.com/kafka-customer-get-what-needs-only-45d95f9b1105 — Sivaram Rasathurai, Jan 30 '22 at 03:27

score 2 · Accepted Answer · answered Feb 18 '19 at 08:07

2

In Kafka it's not possible doing something like that. The consumer consumes messages one by one, one after the other starting from the latest committed offset (or from the beginning, or seeking at a specific offset). Depends on your use case, maybe you could have a different flow in your scenario: the message taking the process to do goes into a topic but then the application which processes the action, then writes the result (completed or failed) in two different topics: in this way you have all completed separated from failed. Another way is to use a Kafka Streams application for doing the filtering but taking into account that it's just a sugar, in reality the streams application will always read all the messages but allowing you to filter messages easily.

answered Feb 18 '19 at 08:07

ppatierno

9,431
1
30
45

so I can have 3 topics, 1 for whole log, 1 for completed status , 1 for failure status... job will write to topic 1, then will filter data based on status to other topic. – Prabhanj Feb 18 '19 at 09:00
exactly, somehow the status for you is the kind of the message which deserves a different topic in this use case (one for completed and one for failure) – ppatierno Feb 18 '19 at 09:03
is it good approach, to have single topic with two partitons(one for completed,one for failure), while sending will keep logic in producer to send data to respective partitions... at consumer end, will create separated consumer_groups , one group to read from failed partition and other to read from completed partition – Prabhanj Feb 19 '19 at 09:44
1

the producer side could be good yes but you need to implement a custom partitioner for doing that. On the consumer side is quite the opposite, two consumers needs to be in the same consumer group in order to have one partition assigned each one. If they are part of different consumer groups they will get all messages from both partitions. In any case it doesn't work well, because if one consumer crashes, the other will get the other partition (receiving completed and failed messages). You could avoid using consumer groups but direct partitions assignment. – ppatierno Feb 19 '19 at 09:52

score 0 · Answer 2 · answered Feb 18 '19 at 08:06

You can create two different topics; one for completed and another for failure status. And then read messages from the completed topics to handle them.

Otherwise, if you want them to be in a single topic and want to read only completed ones, I believe you need to read them all and ignore the failure ones using a simple if-else condition.

score 0 · Answer 3 · answered Feb 18 '19 at 09:22

Kafka consumer doesn't support this kind of functionality upfront. You will have to consume all events sequentially, filter out the status completed events and put it somewhere. Instead you can consider using Kafka Streams application where you can read the data as a stream and filter the events where flow_status = "completed" and publish in some output topic or some other destination.

Example :

KStream<String,JsonNode> inputStream= builder.stream(inputTopic);
KStream<String,JsonNode> completedFlowStream = inputStream.filter(value-> value.get("flow_status").equals("completed"));

P.S. Kafka doesn't have official release for Python API for KStream but there is open source project : https://github.com/wintoncode/winton-kafka-streams

score 0 · Answer 4 · answered Jan 30 '20 at 18:51

As of today it is not possible to achieve it at broker end, there is a Jira feature request open to apache kafka to get this feature implemented, you can track it here, i hope they will get this implemented in near future: https://issues.apache.org/jira/browse/KAFKA-6020

I feel the best way is to use a RecordFilterStrategy (Java/spring) interface and filter it at consumer end.

reading only specific messages from kafka topic

4 Answers4