can a kafka consumer filter messages before polling all of them from a topic?

Question

It was said that consumers can only read the whole topic. No luck doing evaluations on brokers to filter messages.

It implies that we have to consume/receive all messages from a topic and filter them on the client side.

That's too much. I was wondering if we can filter and receive specific types of messages, based on somethings already passed to brokers, such as the msg keys or other things.

from the method, Consumer.poll(timeout), it seems no extra things we can do.

Your topic should only carry one type of messages. Point is if you do your filtration while inserting data into separate topics then consuming from each topic should be pretty straighforward — Infamous, Jun 26 '18 at 19:59
If number of consumers was large, like, a million, and the kafka is used as a pipe line among the consumers' communications, it would not be a solution for creating a specific topic for each consumer, right? — mikej1688, Jun 27 '18 at 15:54
No. My understanding of consumers is that consumers are dumb; meaning the consumers are just listening to a particular topic and processing whatever comes out of the Kafka queue. The moment you start using Kafka as a "communication pipeline" between these so called "consumers" you are running into trouble. What you need is a ESB if you want to manage communication between several million components. — Infamous, Jun 27 '18 at 18:23
From my perspective, the original question has been already answered on below responses. Please mark the correct answer or provide a feedback why has not been answered. Thanks! :) — dbustosp, Jun 28 '18 at 04:04

score 8 · Answer 1 · answered Jun 26 '18 at 20:21

8

No, with the Consumer you cannot only receive some messages from topics. The consumer fetches all messages in order.

If you don't want to filter messages in the Consumer, you could use a Streams job. For example, Streams would read from your topic and only push to another topic the messages the consumer is interested in. Then the consumer can subscribe to this new topic.

answered Jun 26 '18 at 20:21

Mickael Maison

25,067
7
71
68

3

From my understanding, even with streams you will still bring the data to the Client side. – dbustosp Jun 26 '18 at 20:27
Not sure how your pipeline looks like but in some cases you can run Streams apps "close" to the cluster where it's relatively cheap to consume and write back to Kafka. – Mickael Maison Jun 26 '18 at 21:10
Not sure how this can be supported, I would appreciate if you can give me an example :) – dbustosp Jun 26 '18 at 21:15
Thanks. Kafka supports a pub/sub model. My original goal was to take the benefit of its high throughput while utilizing it as a pipeline for communications among the consumers (i.e., clients). It might not be the right way for using Kafka. – mikej1688 Jun 28 '18 at 03:12
https://issues.apache.org/jira/browse/KAFKA-6020 – Nikolay Dimitrov Apr 26 '23 at 08:35

Bitswazsky · Answer 2 · 2018-06-27T09:04:48.147

Each Kafka topic should contain messages that are logically similar, just to stay on topic. Now, sometimes it might happen that you have a topic, let's say fruits, which contains different attributes of the fruit (maybe in json format). You may have different fruits messages pushed by the producers, but want one of your consumer group to process only apples. Ideally you might have gone with topic names with individual fruit name, but let's assume that to be a fruitless endeavor for some reason (maybe too many topics). In that case, you can override the default partitioning scheme in Kafka to ignore the key and do a random partitioning, and then pass your custom-partitioner class through the partitioner.class property in the producer, that puts the fruit name in the msg key. This is required because by default if you put the key while sending a message, it will always go to the same partition, and that might cause partition imbalance.

The idea behind this is sometimes if your Kafka msg value is a complex object (json, avro-record etc) it might be quicker to filter the record based on key, than parsing the whole value, and extracting the desired field. I don't have any data right now, to support the performance benefit of this approach though. It's only an intuition.

Maybe a point-to-point messaging application, rather than Kafka, would be more suitable. — mikej1688, Jun 28 '18 at 03:30

score 2 · Answer 3 · answered Jun 26 '18 at 20:24

2

Once records are already pushed into Kafka cluster, there is not much that you can do. Whatever you want to filter, you will always have to bring the chunks of data to the client.

Unfortunately, the only option is to pass that logic to the Producers, in that way you can push the data into multiple topics based on particular logic you can define.

answered Jun 26 '18 at 20:24

dbustosp

4,208
25
46

Agree. Once subscribing to the topic, then the consumer would accept whatever in the topic and a filtering process would be done on the consumer side. The drawback for this is a waste of the network bandwidth. – mikej1688 Jun 28 '18 at 03:22
@mikej1688 there is nothing you can do on Consumers side. There are others strategies you can follow in order to save network bandwidth. – dbustosp Jun 28 '18 at 04:02
It seemed like Solace messaging app, might do more custom stuff. Here is their slogan, "Message Routing, Filtering & Ordering" – mikej1688 Jun 28 '18 at 21:59

score 0 · Answer 4 · edited Mar 17 '22 at 07:29

Kafka Consumer will receive all messages from topic. But if there is any custom message type (MyMessage) that only needs to be consumed then it can be filtered in Deserializer class. If the consumer gets two types of messages like String and MyMessage then String type messages will be ignored and MyMessage type messages will be processed.

public class MyMessageDeserializer implements Deserializer<MyMessage> {

@Override
public MyMessage deserialize(String topic, byte[] data) {
    try {
        if (data == null){
            logger.info("Null received at deserializing");
            return null;
        }
        return objectMapper.readValue(new String(data, "UTF-8"), MyMessage.class);
    } catch (Exception e) {
        logger.error("Deserialization exception: " + e.getMessage());
    }
    return null;
}
}

can a kafka consumer filter messages before polling all of them from a topic?

4 Answers4