Best way to load large csv ( 200+ GB ) data to Teradata from HDFS?

Question

Question is pretty simple, looking for a way to safely and optimally load very large csv data ( > 200gb ) to Teradata DB. Due to storage restrictions, we have kept the data file in hdfs and need it loaded into a Teradata table. Chopping or splitting the csv to smaller csv is possible but will probably consider it as the last resort in which case any option will work.

Possible solutions that was already tried:-

 1. Sqoop export: Failing due to resources despite pushing maximum number of mappers.
 2. Nifi flow: getHDFS > SplitText > SplitText..... > CSVtoAvro > PutDatabaseRecord.

But processor seems to hang due to memory issues I feel.

Need some way to perhaps split the file into smaller files and insert batches of 250000 into TD ?

Any help would be appreciated.

It would hang in the first processor, getHDFS. I can see tasks getting queued but dont see In/Out counter changing. — StrangerThinks, Nov 12 '19 at 19:22
If a flow file has not been transferred out of HDFS then it is probably still transferring the file, it has to stream 200GB from HDFS to NiFi's content repository, this could take a long time depending on network speed and disk I/O — Bryan Bende, Nov 12 '19 at 19:53
@BryanBende Thanks for the input Bryan! I can wait and see if it processes it. Any idea if there is any limitation in the size of flowfile or it completely depends on cluster capabilities ? — StrangerThinks, Nov 12 '19 at 20:15
StrangerThinks, I am with Bryan on this issue, I am facing a similar delay (seemingly hanging, but not actually) https://stackoverflow.com/questions/58824311/why-is-executesqlrecord-taking-a-long-time-to-start-outputting-flow-files-on-lar. — Vijay Kumar, Nov 12 '19 at 20:26
There is not really a limit on NiFi itself, it mostly bound by the network and disk I/O. As a test you could try installing hadoop client on the same NiFi node, and using hdfs command line to transfer the same file from HDFS to NiFi server local disk, most likely it should take almost same amount of time as GetHDFS. — Bryan Bende, Nov 12 '19 at 20:30
You will probably want to limit the number of follow on processors because everyone one of them has to read 200GB and write back a new 200GB. Technically you could go GetHDFS -> PutDatabaseRecord with a Csv Reader, but if want batches going into PutDatabaseRecord then you are correct about using multiple SplitText processors, can probably eliminate ConvertCsvToAvro. — Bryan Bende, Nov 12 '19 at 20:32
@VijayKumar hey, I couldn't comment on your question due to reputation, but I am wondering if you can try using sqoop to import the data? I can confirm it can easily pull 150+gb data from RDBMS. However, if you are looking for nifi only solution then its diff story. Thanks — StrangerThinks, Nov 12 '19 at 20:33
@BryanBende Im trying to approach the flow as per https://community.cloudera.com/t5/Community-Articles/Ingesting-a-Big-CSV-file-into-Kafka-using-a-multi-stages/ta-p/247229. Seems in order to reduce java memory footprint its better to split the csv into smaller flowfiles and then batch insert using PutDatabaseRecord. Tested with a 3gb csv and it performed better than directly sending to putdbrcrd. any thoughts? — StrangerThinks, Nov 12 '19 at 20:41

Best way to load large csv ( 200+ GB ) data to Teradata from HDFS?

0 Answers0