What is most efficient way to write from kafka to hdfs with files partitioning into dates

Question

I'm working on project that should write via kafka to hdfs. Suppose there is online server that writes messages into the kafka. Each message includes timestamp in it. I want to create a job that the output will be a file/files according to timestamp in messages. For example if the data in kafka is

 {"ts":"01-07-2013 15:25:35.994", "data": ...}
 ...    
 {"ts":"01-07-2013 16:25:35.994", "data": ...}
 ... 
 {"ts":"01-07-2013 17:25:35.994", "data": ...}

I would like to get the 3 files as output

  kafka_file_2013-07-01_15.json
  kafka_file_2013-07-01_16.json
  kafka_file_2013-07-01_17.json

And of course If I'm running this job once again and there is a new messages in queue like

 {"ts":"01-07-2013 17:25:35.994", "data": ...}

It should create a file

  kafka_file_2013-07-01_17_2.json // second  chunk of hour 17

I've seen some open sources but most of them reads from kafka to some hdfs folder. What is the best solution/design/opensource for this problem

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

You should definitely check out Camus API implementation from linkedIn. Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. Check out this post I have written for a simple example which fetches from twitter stream and writes to HDFS based on tweet timestamps.

Project is available at github at - https://github.com/linkedin/camus

Camus needs two main components for reading and decoding data from Kafka and writing data to HDFS –

Decoding Messages read from Kafka

Camus has a set of Decoders which helps in decoding messages coming from Kafka, Decoders basically extends com.linkedin.camus.coders.MessageDecoder which implements logic to partition data based on timestamp. A set of predefined Decoders are present in this directory and you can write your own based on these. camus/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/

Writing messages to HDFS

Camus needs a set of RecordWriterProvider classes which extends com.linkedin.camus.etl.RecordWriterProvider that will tell Camus what’s the payload that should be written to HDFS.A set of predefined RecordWriterProvider are present in this directory and you can write your own based on these.

camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common

Camus has been retired by linkedin with gobblin. [for details](https://engineering.linkedin.com/blog/2016/04/gobblin-gobbles-camus--looks-towards-the-future) — Vezir, May 08 '16 at 20:09
Goblin docs: http://gobblin.readthedocs.io/en/latest/ Goblin source: https://github.com/linkedin/gobblin — Night Owl, Feb 02 '17 at 16:22
Here is an example with Gobblin. https://cwiki.apache.org/confluence/display/GOBBLIN/Kafka-HDFS-Ingestion — tolgabuyuktanir, Jun 11 '18 at 09:12

score 2 · Answer 2 · answered Nov 11 '15 at 00:10

If you're looking for a more real-time approach you should check out StreamSets Data Collector. It's also an Apache licensed open source tool for ingest.

The HDFS destination is configurable to write to time based directories based on the template you specify. And it already includes a way to specify a field in your incoming messages to use to determine the time a message should be written. The config is called "Time Basis" and you can specify something like ${record:value("/ts")}.

*full disclosure I'm an engineer working on this tool.

swamoch · Answer 3 · 2017-02-09T10:36:10.783

2

if you are using Apache Kafka 0.9 or above, you can use the Kafka Connect API.

check out https://github.com/confluentinc/kafka-connect-hdfs

This is a Kafka connector for copying data between Kafka and HDFS.

edited Feb 09 '17 at 10:36

answered Feb 09 '17 at 08:50

swamoch

198
1
3
15

_Links to external resources are encouraged, but please add context around the link so your fellow users will have some idea what it is and why it’s there. Always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline._ – Bugs Feb 09 '17 at 09:13
This connect HDFS writer does not yet support writing json files – Martin Andersson Nov 10 '17 at 09:50

score 1 · Answer 4 · answered Nov 15 '16 at 01:28

1

Check this out for continuous ingestion from Kafka to HDFS. Since it depends on Apache Apex, it has the guarantees Apex provides.

https://www.datatorrent.com/apphub/kafka-to-hdfs-sync/

answered Nov 15 '16 at 01:28

ashwin111

146
1
4

score 0 · Answer 5 · answered Jul 10 '13 at 00:09

0

Checkout Camus: https://github.com/linkedin/camus

This will write data in Avro format though... others RecordWriters are pluggable.

answered Jul 10 '13 at 00:09

ggupta1612

21
2

What is most efficient way to write from kafka to hdfs with files partitioning into dates

5 Answers5

Decoding Messages read from Kafka

Writing messages to HDFS

Linked