many queries in a task to generate json

Question

So I've got a task to build which is going to archive a ton of data in our DB into JSON.

To give you a better idea of what is happening; X has 100s of Ys, and Y has 100s of Zs and so on. I'm creating a json file for every X, Y, and Z. But every X json file has an array of ids for the child Ys of X, and likewise the Ys store an array of child Zs..

It more complicated than that in many cases, but you should get an idea of the complexity involved from that example I think.

I was using ColdFusion but it seems to be a bad choice for this task because it is crashing due to memory errors. It seems to me that if it were removing queries from memory that are no longer referenced while running the task (ie: garbage collecting) then the task should have enough memory, but afaict ColdFusion isn't doing any garbage collection at all, and must be doing it after a request is complete.

So I'm looking either for advice on how to better achieve my task in CF, or for recommendations on other languages to use..

Thanks.

It's possible your code is not as efficient as it could be. Maybe you should post a (simplified) sample? — Adam Tuttle, Mar 05 '11 at 03:28
@Adam Tuttle: while possible code could be more efficient, it's likely the sheer number that's causing the issue, see my answer . — orangepips, Mar 05 '11 at 10:03

Mark · Answer 1 · 2011-03-07T20:13:12.413

1) If you have debugging enabled, coldfusion will hold on to your queries until the page is done. Turn it off!

2) You may need to structDelete() the query variable to allow it to be garbage collected, otherwise it may persist as long as the scope that has a reference to it exists. eg., <cfset structDelete(variables,'myQuery') />

3) A cfquery pulls the entire ResultSet into memory. Most of the time this is fine. But for reporting on a large result set, you don't want this. Some JDBC drivers support setting the fetchSize, which in a forward, read only fashion, will let you get a few results at a time. This way you can deal with thousands and thousands of rows, without swamping memory. I just generated a 1GB csv file in ~80 seconds, using less than 100mb of heap. This requires dropping out to Java. But it kills two birds with one stone. It reduces the amount of data brought in at a time by the JDBC driver, and since you're working directly with the ResultSet, you don't hit the cfloop problem @orangepips mentioned. Granted, it's not for those without some Java chops.

You can do it something like this (you need cfusion.jar in your build path):

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.sql.ResultSet;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;

import au.com.bytecode.opencsv.CSVWriter;
import coldfusion.server.ServiceFactory;

public class CSVExport {
    public static void export(String dsn,String query,String fileName) {
        Connection conn = null;
        Statement stmt = null;
        ResultSet rs = null;
        FileWriter fw = null;
        BufferedWriter bw = null;


        try {
            DataSource ds = ServiceFactory.getDataSourceService().getDatasource(dsn);
            conn = ds.getConnection();
            // we want a forward-only, read-only result.
            // you may want need to use a PreparedStatement instead.
            stmt = conn.createStatement(
                ResultSet.TYPE_FORWARD_ONLY,
                ResultSet.CONCUR_READ_ONLY
            );
            // we only want to go forward!
            stmt.setFetchDirect(ResultSet.FETCH_FORWARD);
            // how many records to pull back at a time.
            // the hard part is balancing memory usage, and round trips to the database.
            // basically sacrificing speed for a lower memory hit.
            stmt.setFetchSize(256);
            rs = stmt.executeQuery(query);
            // do something with the ResultSet, for example write to csv using opencsv
            // the key is to stream it. you don't want it stored in memory.
            // so excel spreadsheets and pdf files are out, but text formats like 
            // like csv, json, html, and some binary formats like MDB (via jackcess)
            // that support streaming are in.
            fw = new FileWriter(fileName);
            bw = new BufferedWriter(fw);
            CSVWriter writer = new CSVWriter(bw);
            writer.writeAll(rs,true);
        }
        catch (Exception e) {
            // handle your exception.
            // maybe try ServiceFactory.getLoggingService() if you want to do a cflog.
            e.printStackTrace();
        }
        finally() {
            try {rs.close()} catch (Exception e) {}
            try {stmt.close()} catch (Exception e) {}
            try {conn.close()} catch (Exception e) {}
            try {bw.close()} catch (Exception e) {}
            try {fw.close()} catch (Exception e) {}
        }
    }
}

Figuring out how to pass parameters, logging, turning this into a background process (hint: extend Thread) etc. are separate issues, but if you grok this code, it shouldn't be too difficult.

4) Perhaps look at Jackson for generating your json. It supports streaming, and combined with the fetchSize, and a BufferedOutputStream, you should be able to keep the memory usage way down.

ya I noticed more memory was being used with debugging on, it's off. — erikvold, Mar 05 '11 at 19:08
I do call struct delete, like I said there are no references to these queries that should be gc — erikvold, Mar 05 '11 at 19:08

score 4 · Accepted Answer · edited May 23 '17 at 12:26

Eric, you are absolutely correct about ColdFusion garbage collection not removing query information from memory until request end and I've documented it fairly extensively in another SO question. In short, you hit OoM Exceptions when you loop over queries. You can prove it using a tool like VisualVM to generate a heap dump while the process is running and then running the resulting dump through Eclipse Memory Analyzer Tool (MAT). What MAT would show you is a large hierarchy, starting with an object named (I'm not making this up) CFDummyContent that holds, among other things, references to cfquery and cfqueryparam tags. Note, attempting to change it up to stored procs or even doing the database interaction via JDBC does not make difference.

So. What. To. Do?

This took me a while to figure out, but you've got 3 options in increasing order of complexity:

<cthread/>
asynchronous CFML gateway
daisy chain http requests

Using cfthread looks like this:

<cfloop ...>
    <cfset threadName = "thread" & createUuid()>
    <cfthread name="#threadName#" input="#value#">
        <!--- do query stuff --->
        <!--- code has access to passed attributes (e.g. #attributes.input#) --->
        <cfset thread.passOutOfThread = somethingGeneratedInTheThread>
    </cfthread>
    <cfthread action="join" name="#threadName#">
    <cfset passedOutOfThread = cfthread["#threadName#"].passOutOfThread>
</cfloop>

Note, this code is not taking advantage of asynchronous processing, thus the immediate join after each thread call, but rather the side effect that cfthread runs in its own request-like scope independent of the page.

I'll not cover ColdFusion gateways here. HTTP daisy chaining means executing an increment of the work, and at the end of the increment launching a request to the same algorithm telling it to execute the next increment.

Basically, all three approaches allow those memory references to be collected mid process.

And yes, for whoever asks, bugs have been raised with Adobe, see the question referenced. Also, I believe this issue is specific to Adobe ColdFusion, but have not tested Railo or OpenDB.

Finally, have to rant. I've spent a lot of time tracking this one down, fixing it in my own large code base, and several others listed in the question referenced have as well. AFAIK Adobe has not acknowledge the issue much-the-less committed to fixing it. And, yes it's a bug, plain and simple.

Thanks for the response, I had been thinking about option 1 & 2 while going to sleep last night, it's nice to hear from someone else that had to tackle the same issue. I will probably end up doing option 1 or 2 until I find a better environment for this task. — erikvold, Mar 05 '11 at 19:16
@Erik Vold: actually the cfthread approach along with a high or no timeout environment has worked very well for me doing a number of different long running processes. — orangepips, Mar 05 '11 at 19:32

many queries in a task to generate json

2 Answers2

Linked