Ignored XML elements show up near eXist-db's lucene search results

Question

I'm building an application with eXist-db which works with TEI files and transform them into html.

For the search function I configured lucene to ignore some of the tags.

<collection xmlns="http://exist-db.org/collection-config/1.0" xmlns:teins="http://www.tei-c.org/ns/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <fulltext default="none" attributes="false"/>

        <lucene>
        <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
        <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text match="//teins:TEI">

                <inline qname="p"/>
                <inline qname="text"/>

                <ignore qname="teins:del"/>
                <ignore qname="teins:sic"/>
                <ignore qname="teins:index"/>
                <ignore qname="teins:term"/>
                <ignore qname="teins:note"/>

            </text>
        </lucene>


    </index>
</collection>

Well, that kinda works out, the elements don't show up in the search results directly, but in the snippets before and after the matched text, which are returned by the kwic module. Is there a way to remove them or to apply a XSL transformation before indexing?

example TEI:

...daß er sie zu entwerten sucht. Wie 
                   <index>
                        <term>Liebe</term>
                        <index>
                            <term>und Hass</term>
                        </index>
                    </index>
Liebe Ausströmung inneren Wertes ist,...

When I search for "Ausströmung", the query results into

 ....sucht. Wie Liebe und Hass Liebe    Ausströmung     inneren Wertes ist,...

But should result into

 ....sucht. Wie Liebe   Ausströmung     inneren Wertes ist,...

When I search for "Hass" this text snippet does not shows up in the results.

For the search functions: I'm strictly sticking to the Shakespeare example in the documentation.

Jens Østergaard Petersen · Accepted Answer · 2014-01-19T10:45:33.763

Let's take point of departure in eXist-db's Shakespeare app. Say you have index entries there. You do not want hits in the index terms - this the index configuration takes care of - but you also do not want them output to the KWIC display - this you have to code yourself.

If you look in app.xql, you will see there is a function named app:filter called from app:show-hits. This you can use to remove parts of the output to the KWIC display, based on the name of the parent of the text node that is output.

This will give what you want:

declare %private function app:filter($node as node(), $mode as xs:string) as xs:string? {
    let $ignored-elements := doc('/db/system/config/db/apps/shakespeare/collection.xconf')//*:ignore/@qname/string()
    let $ignored-elements := 
        for $ignored-element in $ignored-elements
        let $ignored-element := substring-after($ignored-element, ':')
        return $ignored-element
    return
        if (local-name($node/parent::*) = ('speaker', 'stage', 'head', $ignored-elements)) 
        then ()
        else 
            if ($mode eq 'before') 
            then concat($node, ' ')
            else concat(' ', $node)
};

You can of course hard-code the elements to ignore, as in ('speaker', 'stage', 'head', 'sic', 'term', 'note') ('index' is not needed here since you must always use 'term'), but I wanted to show that you do not have to. However, if you do not hard-code the elements to ignore, you should certainly move the assignment of $ignored-elements out of the function, for instance to a variable declared in the query prolog, so the database (collection.xconf) does not get called for every text node you encounter: this really is stupid, but I have put in all in one function for the sake of simplicity.

PS: namespace prefixes can be anything you choose, but the standard namespace prefix for the http://www.tei-c.org/ns/1.0 namespace is "tei", and changing it to "teins" can only lead to confusion.

Thank you, this solved my problem. Currently I'm working on an installed Verion from last may, so the filter function looks a bit different. One last thing: is it possible to retrieve the '/db/system/config/db/apps/shakespeare/collection.xconf' in a dynamic way? If I move the Application to another folder, the path will change too. I've changed this to doc(fn:concat('/db/system/config', $config:app-root, '/collection.xconf')) but that looks very messy and ugly. Is there a better solution to access collections realtive to the Application root? — romedius, Jan 18 '14 at 21:37
If you see that as messy and ugly, you had better start getting used to it - this is how a good app is built. I for one find it beautiful. - Would you please correct "Ignored XML attributes" in your question title to "Ignored XML elements"? - Do you declare and bind $ignored-elements in the query prolog? — Jens Østergaard Petersen, Jan 19 '14 at 10:32

Ignored XML elements show up near eXist-db's lucene search results

1 Answers1

Linked