Wring JSON data to a text file but its going into heap space

Question

I had this code which works fine when the database is small with few records, it writes the json to the file properly, but when the data is huge, it just times out

<cfloop list="t" index="k">
            <cfquery name="qry">
                select * from #k# 
            </cfquery>
            <cfset js= serializeJSON(qry,'struct')>         
            <cffile action="write" file="#k#" output="#js#">
        </cfloop>

I tried using threads but they are also not working, it just creates empty tables files with no values if i use cfthread with joins

Thought of splitting the files into a combination of 1000 records for each table and and then doing like

table_1, table2, table3, of the same table which is table because it has millions of records and skip for those if they have less than 1000 records to create only 1 file.

but i am just thinking which approach is best and a starting pointing is needed

For a starting point, ask yourself some questions. First, what do you intend to do with these files? — Dan Bracuk, Jul 04 '20 at 02:46
I have some intenstions which i cam't share, but that will be of great help if i had those files generated — indy, Jul 04 '20 at 03:03
What version of ColdFusion are you running? What DB and what version? — James A Mohler, Jul 04 '20 at 04:03
Another question you can ask yourself is, "What percentage of the data I'm gathering is data that I have previously gathered and hasn't changed since?" — Dan Bracuk, Jul 04 '20 at 13:20
usually its 100%, but there are some table which never changes, but they have very small percentage of data, so fetching them even will not make much difference — indy, Jul 04 '20 at 14:36
@JamesAMohler There are other ways you might mention, like taking a backup using backup command in cmd prompt which i do not want to do. But i am open to change the query behavior to return me the JSON data because sql server supports it, i can do that basically, removing the option of coldfusion to serialize it, but that still goes into java heap space — indy, Jul 04 '20 at 14:38
I assume it's not about the request timeout, but rather about that it takes too long and hangs the thread at some point? Is your JVM's heap maxing out and you find yourself caught in a major GC loop? — Alex, Jul 04 '20 at 14:52

Alex · Answer 1 · 2020-07-04T15:27:03.597

First of all, let's split this up:

Resultset from database

<cfquery name="qry">
    select * from #k# 
</cfquery>

Database server retrieves data and streams it via network to the ColdFusion server
ColdFusion stores the data in a query object and stores it in the heap

Serializing the resultset from database

<cfset js= serializeJSON(qry,'struct')>

ColdFusion recursively serializes the whole query object
ColdFusion creates a string object that contains the serialized data and stores it in the heap

Writing the serialized resultset from memory onto the filesystem

<cffile action="write" file="#k#" output="#js#">

ColdFusion writes the string object into a file on the filesystem

Doing all of this within the same request/thread

<cfloop list="t" index="k">
    ...
 </cfloop>

Conclusion

Your code tortures the JVM heap, because references have to be kept until the end of each iteration. The GC can only clean up after a full table has been processed. Large tables (1.000.000+ rows) will likely kill the thread or even hang the JVM.

The Fix: Resultset from database

Retrieving large resultsets at once will always hurt performance. While streaming lots of data within a local network (assuming the database is in the same network) just takes a bit more time, the memory required to store the full resultset is going to be an issue for the JVM.

Instead of doing everything at once, consider splitting it up in smaller chunks of data. Use OFFSET and FETCH in the SQL statement to limit the number of rows per loop. Having multiple iterations will allow the Java GC to free up memory used by previous iterations, relieving the heap.

The Fix: Serializing the resultset from database

Same issue. Big datasets whill hurt performance. Split the resultset by serializing row by row instead of all rows at once.

Writing the serialized resultset from memory onto the filesystem

While this one probably doesn't need a fix, you eventually have to switch to writing line after line.

Some code

<cfset maxRowsPerIteration = 50000>

<cfloop list="t" index="k">

    <!--- create empty file to append lines later --->
    <cfset fileWrite(k, "")>

    <cfset rowsOffset = 0>

    <!--- NOTE: you might want to lock the table (prevent write access) here --->

    <!--- infinite loop will be terminated as soon the query no longer returns any rows --->
    <cfloop condition="true">
    
        <!--- fetch a slice of the full table --->
        <cfquery name="qry">
            select * from #k# OFFSET #rowsOffset# ROWS FETCH NEXT #maxRowsPerIteration# ROWS ONLY
        </cfquery>

        <cfif not qry.recordCount>
            <cfbreak>
        </cfif>

        <cfset rowsOffset += maxRowsPerIteration>

        <cfloop query="qry">

            <cfset rowJSON = serializeJSON(
                queryRowToStruct(qry, qry.currentRow)
            )>

            <cfset fileAppend(k, rowJSON, "UTF-8", true)>

        </cfloop>

    </cfloop>

    <!--- NOTE: if you locked the table previously, unlock it here --->

</cfloop>

For an reference implementation of queryRowToStruct, check CFLib.

Thanks for the Heads Up. Couple of questions, the above code can be reduced if i directly return JSOn from query, why use ColdFusion to do a serialization, I am sure that will save some performance. Also it will be writing the query to Same file, so if file becomes too big, it is going to overkill the system while opening, afaik, editpad pro can open large files, but others don't so defintely there needs some change in code, appreciate your response if you add so — indy, Jul 04 '20 at 17:06
Yes, if you can serialize it beforehand, you should do that. That's generally true: SQL servers are very good at computing, let them handle as much as possible (within their domain). — Alex, Jul 04 '20 at 17:37
As for the file dump: You never explained why you want to put the data into a file to begin with. Splitting the content into several files is possible, so is reading a large file by streaming bytes selectively. It all depends on your use case and how you need to access these files. — Alex, Jul 04 '20 at 17:42
I better stream them into the files and have them limited by size or by rows, at least a basic notepad should be able to open all the files without much memory consumption, that is my target, as an example: if one table has million records, i better the file as users_1.json,users_2.json, other tables who has less records can be a one single file, So my goal is to write the files without too much load on jvm. yur code does almost everything except the file splitting and i am using cflock to wrap the code to work it properly with using cfsettingtimeout and cfthreads, i don't trust much on thread — indy, Jul 04 '20 at 18:56
Threads basically sometimes skip the processing and just writes empty files , as lucee limits the use of simultaneous threads to 20. — indy, Jul 04 '20 at 18:57
Since the OP is using Lucee you could return the query rows as structs directly cutting out the need for the `queryRowToStruct` udf: `` — CfSimplicity, Jul 05 '20 at 15:25

score 1 · Answer 2 · answered Jul 05 '20 at 05:40

1

This is really a comment, but it is way too long.

SQL Server 2017 can create JSON directly.

   <cfloop list="t" index="k">
        <cfquery name="qry">
            SELECT (
                SELECT * 
                FROM #k#
                FOR JSON AUTO
            ) AS data  
        </cfquery>

        <cffile action="write" file="#k#" output="#qry.data#">
    </cfloop>

answered Jul 05 '20 at 05:40

James A Mohler

11,060
15
46
72

Yes, That is the approach i am going to follow, but What I am asking is: Going to find a way to split the files in separate files if the tables has many rows and it becomes difficult to open the created files – indy Jul 05 '20 at 10:16
You can use @Alex's approach with OFFSETS and ROWS FETCH NEXT . I also recommend putting an ORDER BY so that the rows are in the really are the rows you want. – James A Mohler Jul 05 '20 at 19:00
1

What i am doing now is: spliiting the file if the size is increated more than 5 mb and create a new file with same name but a counter – indy Jul 05 '20 at 19:46

score 0 · Answer 3 · answered Jul 06 '20 at 12:44

Others have touched upon the JVM and Garbage Collection but have not followed up on the potential quick win due to how CF handles GC.

CF can GC after each function returns and also at the end of each request. So if you do something that uses a lot of memory a few times in a loop, or do something that uses a moderate amount of memory a lot of times in a loop, then you should abstract that 'something' into a function, and call that function inside the loop, so that the memory can be released per iteration if necessary, instead of being held until the end of the request and potentially maxing out the heap space before the end-of-request Garbage Collection.

Eg refactoring your original code to this, is much more GC friendly:

<cffunction name="tableToFile" output="false">
    <cfargument name="tableName" type="variableName" required="true" />
    <cfquery name="local.qry">
        select * from #arguments.tableName#
    </cfquery>
    <cfset local.js = serializeJSON(local.qry,'struct') />
    <cffile action="write" file="#arguments.tableName#" output="#local.js#" />
</cffunction>

<cfloop list="t" index="k">
    <cfset tableToFile(tableName=k) />
</cfloop>

This approach won't solve your problem though if any single iteration of that loop consumes too much memory because the query is too large. If that is your problem then you should implement this in combination with an approach like Alex's to fetch your rows in batches, and presuming your SQL Server is better up to the task than your Lucee Server, then also James' approach to let SQL Server do the JSON serialization.

I have never seen ``. In particular `type="variableName" ` seems strange. Wouldn't that be a string? — James A Mohler, Jul 06 '20 at 16:38
Looks like it is, plus some extra validation. "a string formatted according to ColdFusion variable naming conventions.". Though I'd seriously hope the OP is already validating for sql injection and security purposes.... — SOS, Jul 07 '20 at 01:18
Indeed. I didn't want to dwell on basics like SQL injection. Projects that I work on have our own standards for verifying dynamic output into a query where queryParams do not apply - eg wrapper functions or tags for making object names and column names safe when determined at run time - and I would hope that the OP has their own equivalent. But I still couldn't bring myself to output a potentially unsanitized string into the query, so felt that type="variableName" was a reasonable compromise without distracting from the original question. Turns out it distracted from the question anyway :-D — Sev Roberts, Jul 07 '20 at 11:52

Wring JSON data to a text file but its going into heap space

3 Answers3

Resultset from database

Serializing the resultset from database

Writing the serialized resultset from memory onto the filesystem

Doing all of this within the same request/thread

Conclusion

The Fix: Resultset from database

The Fix: Serializing the resultset from database

Writing the serialized resultset from memory onto the filesystem

Some code