What is the best approach to processing very large CSV files with apache Camel?

Question

i'm looking at apache camel as the best fit to an ETL process which starts with a CSV.

this file will be in the millions of rows, and has an ungodly number of columns ( ~500 )

so far i've looked at a couple of different options - unmarshalling with the CSV data format, and also with camel-bindy but none quite do what i expect.

the csv data format parses every row then passes a list of lists to the next processor - so with the millions of rows options it'll blow up with an out of memory/heap space.

the bindy approach looked great! until i worked out i need to map each column in the csv to the pojo, 99% of which i am not interested in.

so the question is - do i need to write an explicit line by line processor or component which will handle the transform per row and pass it along to the next to() in the route, or is there another option i've not come across yet?

The Same problem of out of memory i have faced while processing XSLX file with large data. Then i came up with a solution that reuse objects instead of creating new one each time while processing row. — Lalit Chattar, Jan 21 '14 at 23:45

score 0 · Answer 1 · edited May 23 '17 at 10:31

0

ahh,

a very similar question was asked a while ago ( didnt find it on first search )

Best strategy for processing large CSV files in Apache Camel

the answer is splitter and stream the output.

edited May 23 '17 at 10:31

Community

1
1

answered Jan 21 '14 at 23:45

Jason Dwyer

208
1
8

What is the best approach to processing very large CSV files with apache Camel?

1 Answers1