1

I use elementpath to handle some XPath queries. I have an XML with linear structure which contains a unique id attribute.

<items>
  <item id="1">...</item>
  <item id="2">...</item>
  <item id="3">...</item>
  ... 500k elements
  <item id="500003">...</item>
</items>

I want the parser to find the first occurence without traversing all the nodes. For example, I want to select //items/item[@id = '3'] and stop after iterating over 3 nodes only (not over 500k of nodes). It would be a nice optimization for many cases.

Nickon
  • 9,652
  • 12
  • 64
  • 119
  • you might want to use iterparse from e.g. etree – Wolfgang Fahl Jun 02 '22 at 09:46
  • The problem is I get XPath queries from another service, so I'm forced to use XPaths here – Nickon Jun 02 '22 at 10:18
  • 1
    is that a contradiction? see https://stackoverflow.com/questions/12332621/use-iterparse-and-subsequently-xpath-on-documents-with-inconsistent-namespace – Wolfgang Fahl Jun 02 '22 at 10:49
  • Yes, it does, but I have noticed that `elementpath` has this `iter_select()` function that could be used as well: https://elementpath.readthedocs.io/en/latest/xpath_api.html#xpath-selectors – Nickon Jun 02 '22 at 11:08
  • If you use `(//items/item[@id = '3'])[1]` then I would expect the XPath engine to stop looking after the first `item` with `id="3"` has been found. But with XPath, you usually get a tree (of e.g. 500k nodes) built before you can navigate that tree. Exception would be Saxon EE with streaming and "early exit" strategy where you can ensure that the forwards only parsing stops once you have your result. – Martin Honnen Jun 03 '22 at 06:48

1 Answers1

1

An example using XSLT 3 streaming with a static parameter for the XPath, then using xsl:iterate with xsl:break to produce the "early exit" once the first item sought has been found would be

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="#all">
  
  <xsl:param name="path" static="yes" as="xs:string" select="'items/item[@id = ''3'']'"/>

  <xsl:output method="xml"/>

  <xsl:mode on-no-match="shallow-copy" streamable="yes"/>

  <xsl:template match="/" name="xsl:initial-template">
    <xsl:iterate _select="{$path}">
      <xsl:if test="position() = 1">
        <xsl:copy-of select="."/>
        <xsl:break/>
      </xsl:if>
    </xsl:iterate>
  </xsl:template>

</xsl:stylesheet>

You can run it with SaxonC EE (unfortunately streaming is only supported by EE) and Python with e.g.

import saxonc

with saxonc.PySaxonProcessor(license=True) as proc:
    print("Test SaxonC on Python")
    print(proc.version)
    
    xslt30proc = proc.new_xslt30_processor()

    xslt30proc.set_parameter('path', proc.make_string_value('/items/item[@id = "2"]'))

    transformer = xslt30proc.compile_stylesheet(stylesheet_file='iterate-items-early-exit1.xsl')
    
    xdm_result = transformer.apply_templates_returning_value(source_file='items-sample1.xml')

    if transformer.exception_occurred:
        print(transformer.error_message)

    print(xdm_result)
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110