Memory efficient XSLT for transforming large XML files

Question

This question is related to a recent answer by michael.hor257k, which is in-turn related to an answer by Dimitre Novatchev.

When used the stylesheet in the above mentioned answer(by michael.hor257k), for a large XML(around 60MB, sample XML is present below) and the transformation was carried out successfully.

When tried another stylesheet, a little different from michael.hor257k's, and is intended to group elements(with a child sectPr) and their following-siblings(until the next following-sibling element with a child sectPr), recursively(i.e., group the elements to the depth of the input XML).

The sample input XML:

<body>
    <p/>
    <p>
        <sectPr/>
    </p>
    <p/>
    <p/>
    <tbl/>
    <p>
        <sectPr/>
    </p>
    <p/>
</body>

The stylesheet I tried:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*[1] | *[sectPr]"/>
        </xsl:copy>
        <xsl:apply-templates select="following-sibling::*[1][not(sectPr)]"/>
    </xsl:template>

    <xsl:template match="*[sectPr]">
        <myTag>
            <xsl:copy>
                <xsl:apply-templates select="*[1] | *[sectPr]"/>
            </xsl:copy>
            <xsl:apply-templates select="following-sibling::*[1][not(sectPr)]"/>
        </myTag>
    </xsl:template>

</xsl:stylesheet>

To my curiosity, I encountered OutOfMemoryError transforming an XML of around 60MB.

I wonder, and I think I do not understand the trick behind the XSLTs provided by both michael.hor257k and Dimitre Novatchev, which wouldn't cause memory exceptions.

What is the big difference between my stylesheet and the above mentioned answers that I get OutOfMemoryError. And how can I update the stylesheet to be memory efficient.

Q: What platform are you on? Windows? What XML parser and/or what language are you using? System.XML with C# and MSVS 2012? How much RAM does your system have? What does task mgr show when you get the "out of memory" error? — FoggyDay, May 02 '15 at 05:50
@FoggyDay It's Windows, using saxon-6.5.5.jar and saxon9.jar. It is a machine with 3GB of RAM. But, the question is more specifc to the way the stylesheet is written, as the referred answer from michael.hor257k does the transformation seemlessly. — Lingamurthy CS, May 02 '15 at 06:19
The point is "60MB" is a relatively small size to be getting "out of memory" errors, even if you "optimize" your stylesheet. Q: What happens when you try increasing JRE memory, e.g. `java -Xmx2048m -jar MyProg.jar myXML.xml`? Q: Have you tried looking at different usage patterns in [JVisualVM](http://docs.oracle.com/javase/7/docs/technotes/guides/visualvm/)? Q: Do you have control over your Java app, such that you can try optimizations such as [clearDocumentPool()](http://stackoverflow.com/questions/19764275/java-lang-outofmemoryerror-while-transforming-xml-in-a-huge-directory)`? — FoggyDay, May 02 '15 at 07:07
@FoggyDay I am well aware that 60MB isn't a big file and change in memory parameters will get the file processed. The question is specifically to improve the XSLT, and to learn the difference. If you've read DimitreNovatchev's answer that I've referred to, the first XSLT is less memory efficient than the second one, and that is what interests me. — Lingamurthy CS, May 02 '15 at 07:15
Lingamurthy CS, Your transformation above and the pointed to transformation by michael.hor257k produce different results when run on the provided source XML document. Which is the wanted result? If you state your problem -- what result should be produced and what parts of the results are defined by what nodes of the XML document, and what are the imposed constraints/invariants, this would be more useful for producing a solution. Also, `60MB` isn't too meaningful by itself. It would be good if you also mention how many lines the 6oMB document consists of. — Dimitre Novatchev, May 02 '15 at 21:10
@DimitreNovatchev Yes, the results produced are different as there was a nested `p` with child `sectPr`. After removing this element from input(the nested `p`), the output from both the stylesheets would be same. I've edited my sample XML. The input I used was having 4.5 million lines. The question is, though the stylesheets are supposed to produce the same results, now, how is the stylesheet I used less memory efficient as is apparent to me. Also, your answer that I've referred to has 2 stylesheets, of which the latter is more memory efficient. I want to understand the difference between both — Lingamurthy CS, May 02 '15 at 23:30
Lingamurthy CS, Please, add the `` declaration, which you removed from the original solution. This strips from the source XML document any whitespace-only text node. Not stripping these nodes may significantly increase the number of nodes and the memory to hold them -- in your case, the required memory to hold the XML document will be almost twice as much compared to the necessary memory to hold the XML document with these nodes stripped. I run your transformation OK, but with the nodes stripped it runs 20% faster. How many GB of RAM has your computer? — Dimitre Novatchev, May 04 '15 at 01:55
Lingamurthy CS, I ran your transformation -- one time as published in the question, and a second time with added `` with Saxon 9.1J -- because it shows also the memory consumption of the transformation. Both runs were successful. In the first case the number of nodes processed was 925004 and 340MB RAM was used. The transformation took 5.3 sec. In the second case the number of nodes was 4336366 and 215 MB RAM was used. The transformation ran in 5.06sec. — Dimitre Novatchev, May 04 '15 at 03:28
@DimitreNovatchev Thank you! I realize how big difference `xsl:strip-space` makes. If you could make your comment an answer, I'll be happy to accept it. — Lingamurthy CS, May 04 '15 at 14:12

Dimitre Novatchev · Accepted Answer · 2015-05-07T01:10:33.737

Lingamurthy CS,

Please, add the <xsl:strip-space elements="*"/> declaration, which you removed from the original solution. This strips from the source XML document any whitespace-only text node.

Not stripping these nodes may significantly increase the number of nodes and the memory to hold them -- in your case, the required memory to hold the XML document will be almost twice as much compared to the necessary memory to hold the XML document with these nodes stripped.

I run your transformation OK, but with the nodes stripped it runs 20% faster -- on MS XslCompiledTransform.

Then I ran your transformation -- one time as published in the question, and a second time with added <xsl:strip-space elements="*"/> with Saxon 9.1J -- because it shows also the memory consumption of the transformation. Both runs were successful. In the first case the number of nodes processed was 9525004 and 340MB RAM was used. The transformation took 5.3 sec. In the second case the number of nodes was 4336366 and 215MB RAM was used. The transformation ran in 5.06sec

@MichaelKay, Yes, but I don't remember which ... :) It was 9M+ — Dimitre Novatchev, May 06 '15 at 22:41

Dan Field · Answer 2 · 2015-05-02T14:43:02.750

In my experience, XSLT is very easy to make memory inefficient. It works really well for smaller transforms (even smaller transforms of lots of files), but when you start doing complex grouping or axis traversal it becomes inefficient for large (15mb+) XML files. Would it be possible to split your large files into small ones? I've used that technique to resolve issues like this before.

Since you're using Windows, you have a few other options as well (especially since you're only using XSLT 1.0). One that might work is to try using the .NET XslCompiledTransform class, which compiles your XSLT to IL. This might not fix the memory issues, but it might perform better on your platform.

The other option would be to make use of the .NET XmlReader and XmlWriter class, which, given your requirements, probably wouldn't be very difficult to implement. These are forward-only XML reading and writing classes. Making use of streaming allows for much greater memory efficiency.

Thanks for your answer, it is good read. The only aim here was to find out the main difference between my XSLT and the answers I'd referred that was making my XSLT less memory efficient, which was pointed out correctly by DimitreNovatchev. I'll keep in mind your suggestions for future. — Lingamurthy CS, May 05 '15 at 09:58

Memory efficient XSLT for transforming large XML files

2 Answers2

Linked