Technology for transformation of huge xml files?

Question

In our organization we have business application and it uses xlst for over 10 years to transform the data between the systems. So with time these file transformations:

XML->XSL->XML become very time consuming.

So the input XML become 100MB - 200MB. But now we have 2,3,4 GB of xml, mainly during system synchronization so we want to replace the xslt (version 1.0) with something more advanced technology. In the future with the biggest data structures this number can even rise.

For that reason I researched different approaches but wonder which is the best:

Rewrite the xslt transformations from version 1.0 to 2.0 (3.0?) and use the fastest processor in order to reduce the time and the memory consumption.( We have over 30 transformations with 1000 lines of rules for transformation/templates.) Implement the best practices for xslt traformations .

2. Use Xquery for transformation. Here is said that for searching data in big xml files XQuery is good. But we need to transfor the whole xml and to make big xml to xml transformation. So here I am wondering if this is good.

Use VTD-XML The world's fastest XML parser. It has java support for XML over 2GB

VTDGenHuge vgh = new VTDGenHuge();

http://vtd-xml.sourceforge.net/codeSample/cs12.html

com.ximpleware Standard VTD-XML supporting up to 2GB document size com.ximpleware.extended Extended VTD-XML supporting up to 256GB document size

Etc..

Have you considered using an event drive/SAX style parser and single pass transformation? This way the size of the file doesn't matter, only how much you need to work with at any given moment. — Peter Lawrey, Aug 05 '16 at 10:06
We are open for everything if it is not slow or consume huge amount of memory. We have sometimes complex logic we need to know a property of a parent element to do something and this parent have 10000 children and so on. here with SAX approach we have to create a lot of complex listeners I think. I do not know if it is good or bad. — Xelian, Aug 05 '16 at 10:08
The report on VTD-XML was pretty old (2006) and the largest file they processed was 15 MB. — Peter Lawrey, Aug 05 '16 at 10:10
http://vtd-xml.sourceforge.net/codeSample/cs12.html I saw this. — Xelian, Aug 05 '16 at 10:11
A listener can easily remember the previous state as required so the number of children don't matter unless those children need to be able to query each other. — Peter Lawrey, Aug 05 '16 at 10:11
In that case, it's worth considering. Using memory mapped files is a good way to utilise much more memory efficiently. — Peter Lawrey, Aug 05 '16 at 10:13
In the most cases children need some information of its parent or grand-parent. It can have base64 encoded files with a lot of stuff in it. But @Peter Lawrey I did not see VTD-XML to be used for transformations and it's license is strange. Paid or not can not understand. — Xelian, Aug 05 '16 at 10:13
I think the first thing to establish is whether the transformation is in principle streamable: that is, the order of things in the result corresponds to the order of things in the input, and the amount of information you need to remember in order to transform the Nth thing in the input to the Nth thing in the output is small and bounded. If that's the case then you've got a choice between writing high-level code in XSLT 3.0 or low-level code in Java/SAX. If the transformation isn't streamable, then the best approach might be an XML database and XQuery. — Michael Kay, Aug 05 '16 at 13:16
@Michael Kay the transformations are different and logic vary. For example in one case we have object a1 and want to transform to b1,but in other we want to transform a2 to b2 if a2 have a child with a specific class and attribute and to pass this value to the b2 as attribute. I do not know if is this make the xml streamable. — Xelian, Aug 05 '16 at 13:29
@Xelian, you can't write a `match="a2[child::foo[@class = 'bar']]"` in a streamable mode in XSLT 3.0 but it is possible to match on `a2` in a streamable mode and then process a copy of `a2` in a non-streamable mode where you have access to child nodes. — Martin Honnen, Aug 05 '16 at 16:48
@Xelian here is a whitepaper comparing the APIs with the latest data... http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf — vtd-xml-author, Aug 06 '16 at 08:00
So @ vtd-xml-author for big files we need a lot of ram not like SAX? So is there any statistic of the consumption for example for 1GB file, 2GB files how much RAM is needed. And I am wondering why is needed so much RAm. VDT uses bi integers just to store addressed..? — Xelian, Aug 15 '16 at 08:08

score 3 · Answer 1 · answered Aug 05 '16 at 11:01

XSLT 3.0 is a work in progress but one of its new features is https://www.w3.org/TR/xslt-30/#streaming where you can write stylesheets with limited memory consumption as, contrary to XSLT 1.0 and 2.0, the processor will not build a full tree of the input, but rather reads through the input once processing each node, only keeping a subtree of the node and its ancestors. Saxon 9 EE implements that http://saxonica.com/html/documentation/sourcedocs/streaming/. The main aim is to allow you to process very large input documents that with XSLT 2.0 would not fit into memory, the drawback is that you can only use a restricted set of XSLT and XPath, so an existing XSLT stylesheet might not work and might need to be rewritten to only use features allowed for streamed processing.

Technology for transformation of huge xml files?

1 Answers1