1

I'm copying the text content of an element I need (which is an embedded xml doc) and creating the new doc from the text, as shown below for a file format delivered to me that I don't control. The issue is that occasionally i get large (3MB+) text values (xml files) delivered in this one element and the parser crashes (java heap space) - I think its because the value-of cant handle the text as a string in one. I'd like to ideally just do a copy-of or some sort of identity transform just to strip other elements, or copy without buffering it into a string. Am I right thinking this is the issue, and is there a way? (without adding more memory).

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" omit-xml-declaration="yes" />
      
  <xsl:template match="/">
    <xsl:value-of select="root/toplevel/row/payload" />
  </xsl:template>
</xsl:stylesheet>
<?xml version='1.0' ?>
<root>
    <toplevel>
        <row>
            <payload> 
                    &lt;?xml version="1.0" encoding="UTF-8"?>
                            &lt;documentProperties type="documentProperties">
                                &lt;producedBy>
                                    &lt;ourName type="string">NAMEHERE&lt;/ourName>
                                    &lt;user>Someone&lt;/user>
                                &lt;/producedBy>
                            &lt;/documentProperties>
            </payload>
            <System>NotWanted</System>
        </row>
    </toplevel>NotWantedEither
</root>

Note the text in the sibling and parent elements at the end is not wanted and does sometimes get included with several attempts at copying I've tried. I only want whats in payload. This code works with this example but not when the text exceeds some size limit.

Output :

                    <?xml version="1.0" encoding="UTF-8"?>
                            <documentProperties type="documentProperties">
                                <producedBy>
                                    <ourName type="string">NAMEHERE</ourName>
       ......
 <.... in practice +3 MB more content in output and source element text here...>
.......
                                    <user>Someone</user>
                                </producedBy>
                            </documentProperties>
devranred
  • 15
  • 3
  • 2
    3MB is microscopic in this context. If you're running out of heap space either you have set the -Xmx option too small, or you got into a recursive loop. Show your Java code and the stack trace (format as code). – Jim Garrison Dec 21 '20 at 22:53
  • The XML parser crashes? Why does changing the XSLT help then? An XSLT processor usually works with an underlying XML parser to parse the XML input into an XDM tree representation but only then executes the XSLT code against the tree. Thus if the parser crashes it is before `xsl:value-of` or any replacement would be executed. – Martin Honnen Dec 21 '20 at 22:54
  • Thanks Jim, java is part of 3rd party app. I cant see it but I can adjust Xmx. Its at 1G now and I've increased to 3G testing with no luck so far. – devranred Dec 22 '20 at 00:01
  • Martin- good point, I think I meant the xslt processor?. But I also asked in the question if I had understood the issue correctly. It copies fine (if unwanted parts also copied) but fails when value-of is used (unwanted parts not in output) so something seems wrong. – devranred Dec 22 '20 at 00:15
  • Are you limited to only using XSL v 1.0? If your processor supports XSL v3.0, there's the `parse-xml()` function. [Here's the documentation for Saxon's implementation](https://www.saxonica.com/html/documentation/functions/fn/parse-xml.html), for instance. – Eiríkr Útlendi Dec 22 '20 at 05:31
  • 1
    I've been dealing with exactly the same problem, the difference is that the text node is 53Mb rather than 3Mb (plus, it contains astral characters, plus, it's being processed using fn:replace() before outputting). What XSLT processor are you using? Recent releases of Saxon have an optimization where xsl:value-of can stream the text directly to the serializer, but it depends very much on the detail of the processing you are doing. – Michael Kay Dec 22 '20 at 08:34
  • Thanks Eirík - yes unfortunately only v1.0. – devranred Dec 22 '20 at 13:58
  • Thanks for the comment Michael, I wanted to test this afternoon and check the processor etc (its SaxonEE i believe). I'm constrained by the fact I just build XSLT templates in this 3rd party app that gets xml files from other systems. I'll continue to look for better options while i work and come back to this if i find anything. Nice to know Im not the only one with the issue though. Not sure 10G will help you with your file size though! - Good luck – devranred Dec 22 '20 at 16:14

1 Answers1

0

Didn't find a desired XSLT solution and needed a working process fairly quickly. Adding more memory solved this for me. Increased heap space Xmx to 10G as a workaround for the odd time this happens.

devranred
  • 15
  • 3