How to join two Kafka streams, each having multiple paritions?

Question

I have two Kafka streams, request and event each partitioned on a common field requestId (last two digits). I want to join both the streams and write to HDFS or Local filesystem? How to write an efficient consumer which considers only the relevant partitions while joining the two streams?

score 3 · Accepted Answer · edited May 23 '17 at 12:08

3

You should use Kafka's Streams API, Apache Kafka's stream processing library, instead of a hand written consumer. To write the data to HDFS you should use Kafka Connect.

For doing the join, look at this question: How to manage Kafka KStream to Kstream windowed join?

Also check out Confluent's documentation about Kafka Streams and Kafka Connect to get started. If you have further question, please start a follow up question (after reading the manual :))

edited May 23 '17 at 12:08

Community

1
1

answered Jan 18 '17 at 17:22

Matthias J. Sax

59,682
7
117
137

Thanks. Looks exactly the kind of thing I need! – Rubbal Jan 19 '17 at 02:24
@matthias-j-sax I read the manual and both these libraries are compatible only with Confluent's version (Thanks to rationalSring for pointing it out). Are there any downsides to using the confluent's version? – Rubbal Jan 20 '17 at 09:56
1

That's not true. Confluent, just repackages Apache Kafka, and it's 100% compatible with ASF version. Only the HDFS connector is not part of Apache Kafka, but you can download it from confluent.io/product/connectors and use with ASF version, too. As Confluent offers Confluent Open Source -- and the code is 100% compatible with ASF Kafka -- there are no disadvantages using Confluent's offer -- only advantages as you get a larger software stack. – Matthias J. Sax Jan 20 '17 at 17:10
That sounds great! Thanks for all the help – Rubbal Jan 21 '17 at 01:20
2

Just to clarify. *Disclaimer: I am an employee at Confluent.* – Matthias J. Sax Jan 22 '17 at 03:53
@MatthiasJ.Sax any suggestions when topics can not be co-partitioned, thanks! – Vikas Tikoo Jun 09 '17 at 16:22
To co-partition topics, both must have the same number of partitions. If that's the case, you just use the same key for both topics to co-partition them. Hope this helps. – Matthias J. Sax Jun 09 '17 at 16:37

score 0 · Answer 2 · answered Jan 19 '17 at 23:50

0

Kafka streams with Kafka Connect (for HDFS) is a straightforward solution. However, it must be pointed out that the HDFS connector for Kafka Connect is only available with the Confluent's version of Kafka. The Apache Kafka Connect only comes with a file writer and not HDFS writer.

answered Jan 19 '17 at 23:50

Basanth Roy

6,272
5
25
25

How about using Flume to write to HDFS from Kafka? – Rubbal Jan 20 '17 at 09:53
1

That is no completely correct: Even if you use vanilla Apache Kafka, you can just download Confluent's HDFS connector https://www.confluent.io/product/connectors/ and use it. Furthermore, there is no "Confluent Version of Kafka" -- it's just repackage but 100% compatible with Apache Kafka (it might contain additional bug fixed -- but this happens rarely). – Matthias J. Sax Jan 20 '17 at 17:06
@Rubbal, I have not used flume for that specific purpose. – Basanth Roy Jan 20 '17 at 19:13
@Rubbal will not be able to write to HDFS with Apache Kafka Connect. That is all I wanted to point out. But thanks for clarifying that confluent is 100% compatible with Apache Kafka. And since confluent has open sourced the code, it should be straightforward I guess. My team is also currently evaluating whether to go with ASF Kafka or Confluent. – Basanth Roy Jan 20 '17 at 19:17

How to join two Kafka streams, each having multiple paritions?

2 Answers2