3

I'm investigating use cases for using streaming in XSL. I know of two clear cases:

A. You need to transform a very large document, the entirety of which cannot be held in memory. B. You only need a small part of the document, and often that "small part" is near the top. You can then save time via early exit.

I'm writing to ask if, in practice, there is a third real use case:

C. You have a simple transformation and want to forgo the CPU time required to build the XML tree. To give an example, imagine a store's shipments are stored in an XML structure with the following format:

Top-level = Year

2nd level = Month

3rd level = Day of shipment

4th level = Shipment ID

5th level = Individual items in shipment

Just for sake of example, consider a transformation whose purpose is to pull information at the "month" level.... only needing data stored in attributes of the month elements, and not needing any information about the descendants of these nodes.

Is it possible that such a transformation could benefit from streaming, even though the entire document must be read? I was hoping that some time might be gained because there is no need to build trees, but in my limited testing it appears this is not the case.

I tried such an example in SAXON 9.5.1.3, and streaming was about 20% slower than a non-streaming example. Perhaps the overhead involved with executing streaming will almost always be worse than the time gained by not building trees? (At least in SAXON, where tree building is very fast.)

Or am I making an error in my testing, and there are clear examples where streaming is more efficient, even when the entire document has to be read?

David R
  • 994
  • 1
  • 11
  • 27

2 Answers2

4

Thanks for the data on Saxon. I'm not surprised by the 20% overhead; I wouldn't have been surprised if it was 60%. Much of this has to do with maturity of the implementation; it's hard enough to get streaming working at all, before you start thinking about making it fast. But I would be surprised if it ever becomes significantly faster than conventional processing in the case of documents that are small enough to handle in memory. That's partly because the performance of the kind of transformations you can do using streaming is likely to be dominated by parsing and serialization cost, which is the same in either model.

I'm aware of a number of areas where there's scope for optimization (or at least for detailed measurement to discover whether there's scope for optimization), but the priority is on getting it all working and getting a sufficient body of test cases into place that optimization can be attempted without risking introducing more bugs.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • I'll probably continue occasionally trying this and I'll let you know if I end up finding a genuine case where one of my real-life analyses ended up benefiting by forgoing tree building. My work actually often has very little serialization costs because I use XSL for analyzing data rather than transforming it. [I'd rather work in a language with native XPath3 than convert everything to PyTables...] – David R Feb 09 '14 at 18:42
  • One other case for reducing memory requirements, of course, is when you have a large number of small documents rather than a single large document. That could be a batch process using collection(), or a high-throughput web service doing lots of transformations. – Michael Kay Feb 09 '14 at 23:32
3

Besides large documents, the other possible advantage of streaming -- depending on the exact characteristics of the stylesheet and input document and how you're using the output -- may be reduced latency. That is, it may be possible to start delivering the start of the document to the next stage of processing (or to the user) sooner than in the more traditional processing model. If you're generating HTML, for example, the browser might be able to start getting the page onto the screen a bit faster.

That could be an advantage in some cases even if throughput (time to finish processing the document) is somewhat reduced.

I'm not sure about Saxon's internals, but Xalan has long offered an "incremental parsing" mode which was intended to permit the same kind of tradeoff; it could reduce latency in some cases, but added some overhead for tracking how much of the input had been parsed so far so throughput might be reduced.

Pick the mode that makes sense for your application. Tools for tasks...

(I'd still like to see someone pick up on the streaming-optimization-by-projection concept that IBM patented. It's the most general approach I've yet seen to recognizing streaming optimization opportunities in unrestricted XSLT. Alas, higher-priority work drew off the resources needed to bring it from prototype to production-quality, and I haven't found personal time to attempt a skunkworks version.)

keshlam
  • 7,931
  • 2
  • 19
  • 33
  • Thanks for the note. I hadn't thought of that, but my professional use of xslt currently doesn't care about latency, only the total amount of time required to effect all the transformation. – David R Feb 09 '14 at 18:35
  • I wonder why Browsers weren't driving the push for streaming XSLT for client-side XSLT rendering.... Oh I forgot, they were busy legitimizing the "real world" ugly sphagetti HTML that was written 20 years ago. – Milind R Feb 16 '15 at 07:15