I am trying to process about 7 million rows daily from an Informix table with Apache Camel but I can't figure out how it can be accomplished.
My first attempt that was working with a very low set of data (about 50k rows) was using .split(body()).parallelProcessing()
like so:
from("quartz2://transaccionesGroup/myTimerTransaction?cron=0+1+0+*+*+?")
.bean(QueryTable.class, "queryData").split(body()).parallelProcessing() // Essentially executes a query on my table and returns a list of MyTable.class
.bean(ProcessTable.class, "processData") // Converts each MyTable object into another type of object (NewData.class) for later processing, storing in them in a synchronized list
.end().to("direct:transform-data");
from("direct:transform-data")
.bean(ProcessNewData.class, "processNewData").split(body()).parallelProcessing() // Obtains list
.bean(AnalyzeData.class, "analyze") // Analyzes the data
.bean(PersistData.class, "persist") // Persists the new data on other tables
.end();
This of course resulted in an "OutOfMemory" error when I tried it with 500k rows on .bean(QueryTable.class, "queryData").split(body()).parallelProcessing()
because it first tried caching all of the data from the query before parsing it. I tried setting fetchSize
to something like 100 but I got the same error, and using maxRows
would only get me the amount of rows I specified and ignore the rest.
My next attempt was using one of Camel's components like sql-component and jdbc and trying to use a Splitter to process each row in separate threads but I got the same exact problem.
sql:
from("quartz2://transaccionesGroup/myTimerTransaction?cron=0+1+0+*+*+?")
.bean(QueryTable.class, "queryDataParams") // Gets the params for my query
.to("sql:SELECT * FROM my_table WHERE date_received BETWEEN :#startDate AND :#endDate?dataSource=dataSourceInformix").split(body()).parallelProcessing()
// The rest would be essentially the same
jdbc:
from("quartz2://transaccionesGroup/myTimerTransaction?cron=0+1+0+*+*+?")
.bean(QueryTable.class, "queryString") // Gets the query to execute
.to("jdbc:dataSourceInformix").split(body()).parallelProcessing()
My last attempt was to use maxMessagesPerPoll
for sql and outputType=StreamList
for jdbc components but unfortunatelly the former only processes one row at a time (as well as it has to be a consumer to be used as such) and the latter gives me an java.sql.SQLException: Cursor not open
exception.
sql:
from("sql:" + query +"?dataSource=dataSourceInformix&maxMessagesPerPoll=100") // I need to be able to use the quartz2 component
jdbc:
.to("jdbc:dataSourceInformix?outputType=StreamList").split(body()).streaming() // Throws exception
The end goal is to be able to process millions of rows without consuming so much memory so as to prevent the "OutOfMemory" error. My idea, if possible is to do the following:
- Create my query on quartz cron-trigger
- Obtain and group N amount of results
- Send a group of results to be process (in another thread) whilst another group is being obtained
- Repeat untill all data has been processed
I know this question is similar to this one but the answer doesn't really help my situation. I also noticed that in the documentation for the sql component it has an outputType=StreamList
option for the producer but it is implemented on version 2.18 and higher while I have version 2.14.1.
Any help and tips would be extremely helpful!
Thanks.
Some other info: Apache Camel Version: 2.14.1 Database: Informix