0

In nifi, if I am listening to Kafka from single topic and based on the routing logic it'll call the respective process group.

However, in RouteOnContent processor, if we give regular expression for checking the occurance of string will it affect performance or how to achieve the a good performance while routing based on condition.

Bryan Bende
  • 18,320
  • 1
  • 28
  • 39
ashok
  • 1,078
  • 3
  • 20
  • 63

2 Answers2

0

It would be more efficient to do some split at KSQL / Stream Processing level into different topics and have Nifi reading from different topics?

  • Dan's answer gives you several good directions to consider. In general it isn't a great idea for producers to dump data of various formats and schemas onto the same topic. Though I realize sometimes you just have to deal with what you're given. Do you have the ability to influence the producer logic so that data is written to appropriate topics per format/schema or must you just deal with it as is? – Joe Witt Jun 19 '19 at 13:26
  • I think this comment was for the question above? –  Jun 21 '19 at 16:13
0

Running a regex on the content of each message is an inefficient approach, consider if you can modify your approach to one of the following:

  • Have your Producers write the necessary metadata into a Kafka Header which can use a much more efficient RouteOnAttribute processor in NiFi. This is still message-at-a-time which has throughput limitations
  • If your messages conform to a schema, use the more efficient KafkaRecord processors in NiFi with a QueryRecord approach which will significantly boost throughput
  • If you cannot modify the source data and the regex logic is involved, it may be more efficient to use a small Kafka Streams app to split the topic before processing the data further downstream
Chaffelson
  • 1,249
  • 9
  • 20
  • The data that I will be receiving from Kafka can be JSON, text or even CSV. So, how can I proceed as query record I cannot use here. And the data processing should happen from a single Kafka topic and after that, it will be segregated based on the data in different topics. So, any other advice is you can give it will be helpful. – ashok Jun 19 '19 at 12:39
  • If you no control of what is written to the topic, and different data sets are written to the topic with no metadata to identify them, then your problem here is your source data - using regex or some type of code approach to identify and separate them is inevitable UNLESS you can get the upstream Producer to write different data to different topics as they ideally should – Chaffelson Jul 30 '19 at 07:29