I am planning to use Kafka hdfs connect for moving messages from Kafka to hdfs. While looking into it, I see there are parameters like flush size and rotate interval Ms with which you can batch messages in heap and write batch at once. Is the batch written to Wal first and then to the mentioned location. I also see it creates a +tmp directory. What's the purpose of+tmp directory . We can directly write whole batch as file under specified location with offset ranges..
Asked
Active
Viewed 507 times
1
-
"From Kafka to Elastic Search" ... Why do you need HDFS Connect for that? – OneCricketeer Feb 01 '19 at 20:01
-
Sorry It was typo.. so Kafka connect works similar to spark receiver based approach ? – Athrey Feb 02 '19 at 21:55
-
Not really. Kafka Connect is its own standalone cluster process. You don't write any code for it – OneCricketeer Feb 04 '19 at 17:11
1 Answers
1
When Kafka consumer writes to HDFS, it writes to WAL first. +tmp
dir holds all the temporary files, which get compressed together into larger HDFS files. Then it is moved to the actual defined location.
Infact you can refer the actual implementation to understand in depth.

Nishu Tayal
- 20,106
- 8
- 49
- 101