2

Reading the docs http://exist-db.org/exist/apps/doc/indexing.xml I'm finding difficult to understand how and if I can improve the performances of a 'read' query (with 2 parameters: a string and an integer). Do eXist-db have a default structural index? Can I improve a 2 params query with a 'range index'?

More details about my XML db (note there are 2 different dbs simply merged on the same root):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<db>
    <docs>
        <doc>
            <header>
                <year>2001</year>
                <number>1</number>
                <type>O</type>
            </header>
            <metas>
                <meta>
                    <number>26001</number>
                    <details>
                        <detail>
                            <description>legge</description>
                            <number>19</number>
                            <date>14/01/1994</date>

                        </detail>
                        <detail>
                            <description>decreto legge</description>
                            <number>453</number>
                            <date>15/11/1993</date>
                        </detail>
                    </details>
                </meta>
            </metas>
        </doc>
        <doc>
            <header>
                <year>2001</year>
                <number>2</number>
                <type>O</type>
            </header>
            <metas>
                <meta>
                    <number>26002</number>
                    <details>
                        <detail>
                            <description>decreto legislativo</description>
                            <number>29</number>
                            <date>03/02/1993</date>
                        </detail>
                    </details>
                </meta>
                <meta>
                    <number>26016</number>
                    <details>
                        <detail>
                            <description>decreto legislativo</description>
                            <number>29</number>
                            <date>03/02/1993</date>

                        </detail>
                    </details>
                </meta>
            </metas>
        </doc>
    </docs>



    <full_text_docs>
        <doc>
            <header>
                <year>2001</year>
                <number>1</number>
                <type>O</type>
                <president>ferrari</president>
            </header>
            <text>lorem ipsum ...

            </text>
        </doc>
        <doc>
            <header>
                <year>2001</year>
                <number>2</number>
                <type>O</type>
                <president>ferrari</president>
            </header>
            <text>lorem ipsum......
            </text>
        </doc>
    </full_text_docs>
</db>

This is my xquery

xquery version "3.0";

let $doc := doc("/db//index_test/test_general.xml")//db/docs/doc
let $fulltxt := doc("/db//index_test/test_general.xml")//db/full_text_docs/doc

return <root> {
  for $a in $doc[metas/meta/details/detail[date="03/02/1993" and number = "29"]]/header
     return $fulltxt[header/year/text()=$a/year/text() and
            header/number/text()=$a/number/text() and
            header/type/text()=$a/type/text()
            ]

} </root>

Basically I simply find for the detail/number and detail/date that matches the input in the first db and take the results for querying the second db. The results are all the <full_text_header> documents that matches.

I would to know if I can create indexes for the fields number and date to improve performance. Note this is the ONLY query I need to optimize (the only I do on this db) obviously number and date changes :).

SOLUTION: For a clear explanation read the joewiz answer. My problem was the correct recognition of the .xconf file. It have to be placed in /db/yourcollectiondir. If you're using eXide when you create the file you should select Xml type with template "eXist-db collection configuration". When you try to save the file you will see a prompt "Apply configuration?" then click 'ok'. Just then run this xquery xmldb:reindex('/db/yourcollectiondir'). Now if all it's right when you run an xquery involving an index you will see the usage in "Monitoring and profiling".

alfredopacino
  • 2,979
  • 9
  • 42
  • 68
  • Note that there is a syntax error in the `for` clause: you start two predicates but only end one of them. – Joe Wicentowski Jul 14 '16 at 15:59
  • Thanks for adding your finding about using the "New" button in eXide rather than the "New XQuery" button for XML documents like `collection.xconf`. This ensures eXide recognizes the .xconf file as an XML document (not an XQuery!) and offers to copy it to the appropriate location in `/db/system`, and allows this index to be applied to your collection (note that eXide automatically performs the reindex for you; you don't have to do it again). I've filed a feature request in eXide to warn users against doing the wrong thing: https://github.com/wolfgangmm/eXide/issues/126. – Joe Wicentowski Jul 18 '16 at 15:42

1 Answers1

2

As that documentation page states, eXist does create a structural index for all XML stored in the database. This is not an index of values, though, so without further indexes, queries based on value (rather than structure) would involve a lookup of values in the DOM. As your data grows larger, looking up values in the DOM gets slower and slower. This is where value-based indexes, such a range index, saves the day. (For a fuller explanation, see the "Indexing" section of Wolfgang Meier's "Tuning the Database" article, which is essential for getting the most performance out of eXist.)

So, yes, you can create indexes for the <number> and <date> fields. I'd recommend the "new range" index, as described on that documentation page. Your collection.xconf file setting up these indexes would look like this:

<collection xmlns="http://exist-db.org/collection-config/1.0" 
            xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <index>
        <range>
            <create qname="number" type="xs:integer"/>
            <create qname="date" type="xs:string"/>
        </range>
    </index>
</collection>

You have to store this within the /db/system/config/ collection, in a subcollection corresponding to the location of your data in the database. So if your data is located in /db/apps/myapp/data, you would place this collection.xconf file in /db/system/config/db/apps/myapp/data.

Note that the configuration here would only affect the for clause's queries of date and number values, and not the predicates in the return clause, which depend on the values of <year> and <type> elements. So, to ensure your query maximized the use of indexes, you should declare indexes on these; it seems that xs:integer would be the appropriate type for each.

Lastly, I would suggest eliminating the /text() steps, which are completely extraneous. For more on the use/abuse of text(), see Evan Lenz's article, "text() is a code smell".

Update (2016-07-17): With the updated code sample above, I have a couple of additional suggestions. First, since the code is in /db/index_test, we will store our files as follows:

Contents of <code>/db/index_test</code> collection, as seen in eXide

Assuming you're using eXide, when you store the collection.xconf file in a collection, eXide will prompt you to have a copy of the file placed in the correct location in /db/system/config. If you're not using eXide, you need to store the collection.xconf file there yourself.

Using the unmodified query, I can confirm that despite the presence of the collection.xconf file, monex shows no indexes are being applied:

monex showing no indexes applied

Let's make a few modifications to the file to ensure indexes are properly applied:

xquery version "3.0";

<root> {
    for $a in doc("/db/index_test/test_general.xml")//detail[date = "03/02/1993" and number = 29]/ancestor::doc/header
    return 
        doc("/db/index_test/test_general.xml")/db/full_text_docs/doc
            [
                header/year = $a/year and
                header/number = $a/number and
                header/type = $a/type
            ]

} </root>

With these modifications, monex shows that indexes are applied to the comparisons in the for clause:

monex showing images applied

The insights here are derived from the "Tuning the Database" article. To get full indexing for all comparisons, you will need to define additional indexes and may need to make similar modifications to your query.

One final note: the version of monex you see in these pictures is using a feature I added this weekend, called "Tare", which tries to filter out other operations from the query profiling results in order to help the user see just the effects of their own query. This feature is still just a pull request, so running the current release version, you won't see identical results.

Joe Wicentowski
  • 5,159
  • 16
  • 26
  • I've done the collection.xconf step, I even tried the reindex with `xmldb:reindex('/db/mydir/mydb.xml')`. But it returns "false" it's like the reindexing doesn't work. Also I have multiple tags in different positions, I want to index just `doc/metas/meta/details/detail/number` – alfredopacino Jul 14 '16 at 13:02
  • 1
    Ah, `xmldb:reindex()` takes a collection URI, not a document URI. If the collection exists, it will return `true()`. To confirm the index is in operation with your query, you can use the query profiler in the monex (Monitoring and Profiling) app. – Joe Wicentowski Jul 14 '16 at 13:21
  • In Index usage tab in Monex I can't see a clear confirm of my index correct setup :/ – alfredopacino Jul 14 '16 at 16:47
  • If you haven't already, I suggest saving your query to, say, /db/test.xq. Then the query will be clearly labeled in the query profiler. I assume you've seen the directions for accessing the query profiler at http://exist-db.org/exist/apps/doc/indexing.xml#D3.2.4. – Joe Wicentowski Jul 15 '16 at 00:32
  • the fact is I'm not sure about the collection.xconf correctness. If it's like joewiz (and exist-db docs) suggest, where is the exact path of the field to index? I repeat: as you can see in my db I have multiple tags and I have to index just 1 of those. Here you can see a "match" attribute usage for the path https://youtu.be/SmOXLTL_wnw?t=15m19s I'm very confuse – alfredopacino Jul 17 '16 at 01:25
  • Let's set the indexes question aside for a minute. Can you update your post with working sample code? – Joe Wicentowski Jul 17 '16 at 01:28
  • I edited the question with a working example. thanks – alfredopacino Jul 17 '16 at 18:12
  • Great, thanks. I've updated my answer with a fuller explanation of how to get indexes working on your query. Please let me know if you have any questions. – Joe Wicentowski Jul 18 '16 at 02:56
  • very thanks for the explanation. But still, I'm confused. (1)The initial configuration.xconf was right (without any defined path for number and date)? (2) The range index could be used in a normal xquery? I thought it should be used in the form `range:eq()`. Anyway I tried your solution I still see "No index" in all profiling entries. – alfredopacino Jul 18 '16 at 10:34
  • (1) Yes, the index I've set up is on all number and date elements. I personally avoid trying to limit indexes by path. (2) Yes, any comparison operator can be used: =, <, >, etc. – Joe Wicentowski Jul 18 '16 at 11:16
  • I think I've shown a working answer to your original question, so for troubleshooting, let's take the discussion to the exist-open mailing list. I'm sure you've just got a small detail wrong, and we can get to the root of it on the mailing list. https://lists.sourceforge.net/lists/listinfo/exist-open. – Joe Wicentowski Jul 18 '16 at 11:24
  • Ok, I'm sure it's my fault, but somehow your solution doesn't work :/. The steps I've done: (1) I defined configuration.xconf in /db/system/config/db/index_test/test_general/ (NOTE: test_general without .xml extension) (2) I copied the same configuration.xconf in /db/index_test/ (3) I runned `xmldb:reindex('/db/index_test')` (4) I runned a xquery checking the profiling monitor..no trace of my index – alfredopacino Jul 18 '16 at 11:27
  • Ok, in (1) why do you place the xconf file in a test_general subdirectory that hasn't been mentioned in your post or my directions? Just directly in index_test please, to keep things simple. In (2) it sounds like you have two different xconf files. Just have one, in index_test and either let eXide copy it into /db/system for you or copy it yourself, but be sure to get the path right, as I suggested. Failing this, I would suggest deleting your existing files & collections. Start afresh with the latest version of the files from this page, and you should get identical results to mine. – Joe Wicentowski Jul 18 '16 at 11:34
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/117609/discussion-between-joewiz-and-alfredopacino). – Joe Wicentowski Jul 18 '16 at 13:06
  • 1
    I edited the question with my solution, but see this http://www.awesomescreenshot.com/image/1405127/49004ef53de3af21e416a7a075fa4f55 . Some fields involved are still unindexed. The result is after some benchmark some query takes more than 10s. I already indexed all the fields involved in the query :/ – alfredopacino Jul 18 '16 at 14:54
  • Great to hear that you've got the 2 indexes in the original question working! As to the "some fields... still unindexed", yes, as I said, you will need to define indexes on those elements too: `` and ``. And you may need to rework the queries a bit to get the indexes to work. Both issues are beyond the scope of the original question (which I think we've thoroughly exhausted), so for further questions, please open up a thread on exist-open on tuning your query to ensure index usage is maximized. – Joe Wicentowski Jul 18 '16 at 15:36
  • Anyway the topic is to improve the performances of the query :) I thought it was enough to index the 2 fields in input. But now I know it isn't. A complete solution isn't offtopic after all and could be very useful to the users. (we can clean up this thread from those comments if you want) – alfredopacino Jul 18 '16 at 23:15