Best way to automatate getting data from Csv files to Datalake

Question

I need to get data from csv files ( daily extraction from différent business Databasses ) to HDFS then move it to Hbase and finaly charging agregation of this data to a datamart (sqlServer ).

I would like to know the best way to automate this process ( using java or hadoops tools )

score 2 · Answer 1 · answered Sep 14 '17 at 07:56

I'd echo the comment above re. Kafka Connect, which is part of Apache Kafka. With this you just use configuration files to stream from your sources, you can use KSQL to create derived/enriched/aggregated streams, and then stream these to HDFS/Elastic/HBase/JDBC/etc etc etc

There's a list of Kafka Connect connectors here.

This blog series walks through the basics:

Thx Robin , i started reading about kafka connctors and it seems easier and less expensive . I'll poc this part and see what happen . Thx again and nice blog btw ;) — rnside, Sep 14 '17 at 13:32

OneCricketeer · Answer 2 · 2017-09-13T18:17:02.477

1

Little to no coding required? In no particular order

Talend Open Studio
Streamsets Data Collector
Apache Nifi

Assuming you can setup a Kafka cluster, you can try Kafka Connect

If you want to program something, probably Spark. Otherwise, pick your favorite language. Schedule the job via Oozie

If you don't need the raw HDFS data, you can load directly into HBase

edited Sep 13 '17 at 18:17

answered Sep 13 '17 at 01:32

OneCricketeer

179,855
19
132
245

Thx @cricket_007 for the answer . could you tell me please what do you mean by " why you can't land directely into hbase " ? – rnside Sep 13 '17 at 09:54
Write some code to parse a CSV and write into Hbase table. Since hbase is already over hdfs data, I don't see the need to put it into hdfs, and then load it into Hbase – OneCricketeer Sep 13 '17 at 10:10
i understand , actually , the there will be other application that need the same data and they want to extract it directly from hdfs . Only mine need a hbase . – rnside Sep 13 '17 at 12:22

Best way to automatate getting data from Csv files to Datalake

2 Answers2