1

I am using the Lahman Baseball database for a project about comparing different query technologies. I already did a bunch of queries in SQL and then used the SQL workbench to port the tables to XML files. I then uploaded them to an existdb server and I am trying to run queries on it but they are way too slow. I have an xml file called appearances.xml that looks like this. However it has 98,146 ROW elements.

<appearances>
<ROW>
    <yearID>1871</yearID>
    <teamID>BS1</teamID>
    <lgID>NA</lgID>
    <masterID>barnero01</masterID>
</ROW>
<ROW>
    <yearID>1871</yearID>
    <teamID>BS1</teamID>
    <lgID>NA</lgID>
    <masterID>barrofr01</masterID>
</ROW>
</appearances>

I also have an xml file called master.xml which looks like this which has 18,354 ROW elements.

<master>
<ROW>
    <masterID>aardsda01</masterID>
    <nameFirst>David</nameFirst>
    <nameLast>Aardsma</nameLast>
</ROW>
<ROW>
    <masterID>aaronha01</masterID>
    <nameFirst>Hank</nameFirst>
    <nameLast>Aaron</nameLast>
</ROW>
</master>

I am trying to run an XQY query to join the two tables and find the first and last name of every entry that has the teamID of 'LAN'.

let $laPlayers := for $appearance in /appearances/ROW
          let $player := /master/ROW/masterID=$appearance/masterID
          where $appearance/teamID eq "LAN"
          return fn:concat($player/nameFirst/text(),' ', 
$player/nameLast/text())

return $laPlayers

It runs fine but it is soooooo slow! In the appearances table, there are multiple entries for each masterID, because of the yearID (which I don't use here) so I tried calling distinct-values first.

let $laIds = fn:distinct-values(for $appearance in /appearances/ROW
              where $appearance/teamID eq "LAN"
              return $appearance/masterID/text())
let $laPlayers = for $player in /master/ROW
              for $id in $laIds
              where $id eq $player/masterID
              order by $player/nameLast
              return fn:concat($player/nameFirst/text(),' ', 
$player/nameLast/text())

return $laPlayers

That made it faster but it is still way slower than it's SQL equivalent. Is there something I can do to make it faster? (here is the SQL equivalent)

SELECT DISTINCT m.nameFirst, m.nameLast FROM appearances a, master m 
WHERE a.teamID = "LAN" AND a.masterID = m.masterID ORDER BY m.nameLast, m.nameFirst;
Chase Lundell
  • 93
  • 1
  • 11
  • Which version of eXist? Have you configured any indexes on the elements you are looking up in this query? – Joe Wicentowski Apr 05 '18 at 15:54
  • 1
    It's the latest version, so 4.0.0. I have not done any configuring of indexes in eXist. I've never used eXist and I'm really new at xquery and xml databases. Is there a simple index configuration example you know of that would be helpful for me to utilize here? – Chase Lundell Apr 05 '18 at 16:03
  • Great, welcome to eXist and the world of XQuery and databases. You may find it useful to join the [exist-open mailing list](https://sourceforge.net/p/exist/mailman/) and/or the [eXist-db room on HipChat](https://existdb.hipchat.com/invite/300223/6ea0341b23fa1cf8390a23592b4b2c39). – Joe Wicentowski Apr 05 '18 at 16:17
  • For this modest amount of data, I'm not sure you really need a database. With Saxon-EE you could run this query directly on the source files, and the optimizer would give you an optimized join without any need to pre-define your indexes. – Michael Kay Apr 07 '18 at 08:57

1 Answers1

1

Your query would benefit from a range index of type xs:string on the <teamID> and <masterID> elements, since an index allows the database to perform a quick lookup of just the indexed values rather than scanning across all of the data for the requested elements and values. To configure such an index, create a collection configuration document like this one:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <range>
            <create qname="teamID" type="xs:string"/>
            <create qname="masterID" type="xs:string"/>
        </range>
    </index>
</collection>

Save this document, naming it collection.xconf, placing it in the location within /db/system/config corresponding to the collection that contains the data. For example, if your data is in /db/apps/my-app/data, then save the collection.xconf file in /db/system/config/db/apps/my-app/data.

Here are links to the eXist documentation about range indexes and broader topics that should help you as you learn:

Joe Wicentowski
  • 5,159
  • 16
  • 26
  • Great, happy to hear it! Also, you might find these posts useful: https://stackoverflow.com/questions/38359614/improve-performance-of-query-with-range-indexes-in-exist-db and https://stackoverflow.com/questions/49174006/massive-differences-in-speed-with-virtually-identical-xquery-exist-db-interoga. – Joe Wicentowski Apr 05 '18 at 16:07
  • Also, regarding your use of the `fn:distinct-values()` function, see this advice https://exist-db.org/exist/apps/doc/tuning.xml#D2.4.12.16. – Joe Wicentowski Apr 05 '18 at 16:10
  • Ok so I tried what you recommended by putting my xml files in db/apps/my-app/data and then putting a collection.xconf in system/config/db/apps/my-app/data with the range indexes and then I saved it and it said to reindex, so I did. I then waited until it said it finished indexing and I ran the query, but it is still running just as slow. Is there something I could have done wrong in setting up the xconf file or the app? It seems like it didn't actually work and I'm querying unindexed files still – Chase Lundell Apr 05 '18 at 17:31
  • Let's try some troubleshooting. Can you open the "monex" app at http://localhost:8080/exist/apps/monex, go to the Profiling section, click Enable Profiling, run your query, then return to Profiling, click Refresh, then click on the Index Usage pane. This will show you a list of all of the queries and whether they used indexes or not. What do you see? For subsequent tests of your queries, use the Clear button to eliminate past results. – Joe Wicentowski Apr 06 '18 at 14:32
  • Man! I'm sorry, I'm not sure what happened but I think something I changed in my attempts to get the indexing right actually worked. I ended up putting the data in a folder under db and then I put the collection config in system/config/db/my-new-folder/ and that seems to work. It wasn't last night but maybe after leaving it to index and do it's thing all night without me messing with it, it was able to finish completely. I opened the monex app and I AM getting the FULL optimization now. The queries are also so much faster now. Thanks for the insight into monex and for being willing to help. – Chase Lundell Apr 06 '18 at 16:12
  • 1
    your recommended post (https://stackoverflow.com/questions/49174006/massive-differences-in-speed-with-virtually-identical-xquery-exist-db-interoga) was the one that led me to the solution – Chase Lundell Apr 06 '18 at 16:13
  • Great, glad to hear you’re back on track. Indexes should work regardless of the depth of your collection, so my guess is that there was some slight mismatch in the collection name or location between the actual data collection and the system config collection. Your persistence paid off, so congrats! – Joe Wicentowski Apr 06 '18 at 16:37