0

I have two Kafka streams, request and event each partitioned on a common field requestId (last two digits). I want to join both the streams and write to HDFS or Local filesystem? How to write an efficient consumer which considers only the relevant partitions while joining the two streams?

Rubbal
  • 779
  • 7
  • 19

2 Answers2

3

You should use Kafka's Streams API, Apache Kafka's stream processing library, instead of a hand written consumer. To write the data to HDFS you should use Kafka Connect.

For doing the join, look at this question: How to manage Kafka KStream to Kstream windowed join?

Also check out Confluent's documentation about Kafka Streams and Kafka Connect to get started. If you have further question, please start a follow up question (after reading the manual :))

Community
  • 1
  • 1
Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks. Looks exactly the kind of thing I need! – Rubbal Jan 19 '17 at 02:24
  • @matthias-j-sax I read the manual and both these libraries are compatible only with Confluent's version (Thanks to rationalSring for pointing it out). Are there any downsides to using the confluent's version? – Rubbal Jan 20 '17 at 09:56
  • 1
    That's not true. Confluent, just repackages Apache Kafka, and it's 100% compatible with ASF version. Only the HDFS connector is not part of Apache Kafka, but you can download it from confluent.io/product/connectors and use with ASF version, too. As Confluent offers Confluent Open Source -- and the code is 100% compatible with ASF Kafka -- there are no disadvantages using Confluent's offer -- only advantages as you get a larger software stack. – Matthias J. Sax Jan 20 '17 at 17:10
  • That sounds great! Thanks for all the help – Rubbal Jan 21 '17 at 01:20
  • 2
    Just to clarify. *Disclaimer: I am an employee at Confluent.* – Matthias J. Sax Jan 22 '17 at 03:53
  • @MatthiasJ.Sax any suggestions when topics can not be co-partitioned, thanks! – Vikas Tikoo Jun 09 '17 at 16:22
  • To co-partition topics, both must have the same number of partitions. If that's the case, you just use the same key for both topics to co-partition them. Hope this helps. – Matthias J. Sax Jun 09 '17 at 16:37
0

Kafka streams with Kafka Connect (for HDFS) is a straightforward solution. However, it must be pointed out that the HDFS connector for Kafka Connect is only available with the Confluent's version of Kafka. The Apache Kafka Connect only comes with a file writer and not HDFS writer.

Basanth Roy
  • 6,272
  • 5
  • 25
  • 25
  • How about using Flume to write to HDFS from Kafka? – Rubbal Jan 20 '17 at 09:53
  • 1
    That is no completely correct: Even if you use vanilla Apache Kafka, you can just download Confluent's HDFS connector https://www.confluent.io/product/connectors/ and use it. Furthermore, there is no "Confluent Version of Kafka" -- it's just repackage but 100% compatible with Apache Kafka (it might contain additional bug fixed -- but this happens rarely). – Matthias J. Sax Jan 20 '17 at 17:06
  • @Rubbal, I have not used flume for that specific purpose. – Basanth Roy Jan 20 '17 at 19:13
  • @Rubbal will not be able to write to HDFS with Apache Kafka Connect. That is all I wanted to point out. But thanks for clarifying that confluent is 100% compatible with Apache Kafka. And since confluent has open sourced the code, it should be straightforward I guess. My team is also currently evaluating whether to go with ASF Kafka or Confluent. – Basanth Roy Jan 20 '17 at 19:17