Best strategy to Handle large data in Apache Camel

Question

I am using Apache Camel to generate monthly reports. I have a MySQL query which when ran against my DB generates around 5 million records (20 columns each). The query itself takes approximately 70 minutes to execute.

To speed up the process, I created 5 seda (worker) routes and used multicast().parallelProcessing() which query the DB in parallel for different time ranges, and then merged the result using an aggregator.

Now, I can see 5 million records in my exchange body (in the form of List<HashMap<String, Object>>). When I try to format this using a Camel Bindy to generate a csv file out of this data, I am getting a GC Overhead Exception. I tried increasing Java Heap Size, but it takes forever to transform.

Is there any other method, to convert this raw data into a well formatted csv file? Can Java 8 streams be useful?

Code

from("direct://logs/testLogs")
    .routeId("Test_Logs_Route")
    .setProperty("Report", simple("TestLogs-${date:now:yyyyMMddHHmm}"))
    .bean(Logs.class, "buildLogsQuery")                             // bean that generates the logs query
    .multicast()
    .parallelProcessing()
    .to("seda:worker1?waitForTaskToComplete=Always&timeout=0",      // worker routes
        "seda:worker2?waitForTaskToComplete=Always&timeout=0",
        "seda:worker3?waitForTaskToComplete=Always&timeout=0",
        "seda:worker4?waitForTaskToComplete=Always&timeout=0",
        "seda:worker5?waitForTaskToComplete=Always&timeout=0");

All my worker routes look like this

from("seda:worker4?waitForTaskToComplete=Always")
    .routeId("ParallelProcessingWorker4")
    .log(LoggingLevel.INFO, "Parallel Processing Worker 4 Flow Started")
    .setHeader("WorkerId", constant(4))
    .bean(Logs.class, "testBean")                                   // appends time-clause to the query based in WorkerID
    .to("jdbc:oss-ro-ds")
    .to("seda:resultAggregator?waitForTaskToComplete=Always&timeout=0");

Aggregator

from("seda:resultAggregator?waitForTaskToComplete=Always&timeout=0")
    .routeId("Aggregator_ParallelProcessing")
    .log(LoggingLevel.INFO, "Aggregation triggered for processor ${header.WorkerId}")
    .aggregate(header("Report"), new ParallelProcessingAggregationStrategy())
    .completionSize(5)
    .to("direct://logs/processResultSet")


from("direct://logs/processResultSet")
    .routeId("Process_Result_Set")
    .bean(Test.class, "buildLogReport");
    .marshal(myLogBindy)
    .to("direct://deliver/ooma");

Method buildLogReport

public void buildLogReport(List<HashMap<String, Object>> resultEntries, Exchange exchange) throws Exception {
        Map<String, Object> headerMap = exchange.getIn().getHeaders();
        ArrayList<MyLogEntry> reportList = new ArrayList<>();

        while(resultEntries != null){
            HashMap<String, Object> resultEntry = resultEntries.get(0);
            MyLogEntry logEntry = new MyLogEntry();

            logEntry.setA((String) resultEntry.get("A"));
            logEntry.setB((String) resultEntry.get("B"));
            logEntry.setC(((BigDecimal) resultEntry.get("C")).toString());
            if (null != resultEntry.get("D"))
                logEntry.setD(((BigInteger) resultEntry.get("D")).toString());
            logEntry.setE((String) resultEntry.get("E"));
            logEntry.setF((String) resultEntry.get("F"));
            logEntry.setG(((BigDecimal) resultEntry.get("G")).toString());
            logEntry.setH((String) resultEntry.get("H"));
            logEntry.setI(((Long) resultEntry.get("I")).toString());
            logEntry.setJ((String) resultEntry.get("J"));
            logEntry.setK(TimeUtils.convertDBToTZ((Date) resultEntry.get("K"), (String) headerMap.get("TZ")));
            logEntry.setL(((BigDecimal) resultEntry.get("L")).toString());
            logEntry.setM((String) resultEntry.get("M"));
            logEntry.setN((String) resultEntry.get("State"));
            logEntry.setO((String) resultEntry.get("Zip"));
            logEntry.setP("\"" + (String) resultEntry.get("Type") + "\"");
            logEntry.setQ((String) resultEntry.get("Gate"));

            reportList.add(logEntry);
            resultEntries.remove(resultEntry);
        }

        // Transform The Exchange Message
        exchange.getIn().setBody(reportList);
    }

A bit unclear of what you are asking however I would expect a GC Overhead exception on a 5 million row csv file. Potentially I would just write smaller CSV files to a directory and once done call a route which then reads the files and appends it. 5 million rows in a CSV file is really going to be pain. — Namphibian, Aug 01 '17 at 02:17
Check this link: https://stackoverflow.com/questions/8122748/best-strategy-for-processing-large-csv-files-in-apache-camel — mgyongyosi, Aug 01 '17 at 08:38

Best strategy to Handle large data in Apache Camel

0 Answers0