0

Coming from this question, I managed one entirely unsatisfactory solution for accessing an eXist-DB collection() from an XSLT 2.0 document loaded from within an eXist-db/Xquery transformation function:

The XSLT file declares a variable :

 <xsl:variable name="coll" select="collection('xmldb:exist:///db/apps/deheresi/data/collection_ms609.xml')"/>

This points to a catalog xml file I created (per Saxon documentation) that looks like this, in order to load the actual collection:

<collection stable="true">
  <doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0001.xml"/>
  <doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0002.xml"/>
  ...
  ...
  <doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0709.xml"/>
  <doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0710.xml"/>
</collection>

This allows the XSLT file to use a key that needs to search across all these files:

<xsl:key name="correspkey" match="tei:seg[@type='dep_event' and @corresp]" use="@corresp"/>

<xsl:variable name="correspvar" select="self::seg[@type='dep_event' and @corresp]/@corresp"/>

<xsl:value-of select="$coll/(key('correspid',$correspvar) except $correspvar)/@id" separator=", "/>

As it stands, if I have 50 documents in the catalog, I get a result in 2 minutes; with all 710 I get a java GC error after 4 minutes.

I have set indexes on relevant nodes in eXist-DB, but this does nothing to performance. It seems to me Saxon is working 'outside' eXist-DB's optimisations, treating eXist-DB as a simple file system.

(For what it's worth, setting href="/db/apps/deheresi/data/ms609_0001.xml" does not let Saxon see the documents.)

I suspect all of this is why the eXist-DB documentation is non-existent.

As it goes, I am looking for solutions for intensive searches of collections from within XSLT 2.0 loaded within eXist-DB by Xquery transform().

If anything, I hope this post helps future searchers encountering the same problem.

jbrehr
  • 775
  • 6
  • 19
  • If a single stylesheet has to pull in 710 documents and index them this for sure will take memory and time. Java memory limits can be adjusted, see the relevant options of the `java` command on your system. What happens if you solely use Saxon from the command line to have it work with a collection of those 710 documents pulled from the file system, does give the same or similar performance and memory problems? It will of course not solve your problem of a lack of integration of Saxon with the eXist-db but I am not sure there is any easy fix for this. – Martin Honnen Oct 23 '18 at 10:16
  • See the options `-Xmssize` and `-Xmxsize` in https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html for instance to adjust `java`'s memory limits. – Martin Honnen Oct 23 '18 at 10:19
  • Indeed I've also run this off the local directory (although that doesn't need a catalogue) with the same effect. I had adjusted the memory, but really beyond a few seconds, it's not a solution. – jbrehr Oct 23 '18 at 10:39

1 Answers1

4

The general architectural principle is: try to move the searching closer to the data. In this case this means: use eXist to find the documents of interest, don't extract every possible candidate document from eXist and then ask Saxon to do the searching. Select the actual documents of interest in an eXist XQuery, and then pass the list of these documents to Saxon in a stylesheet parameter.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164