1

I have an XML file with a bunch of <entry> elements in it (see below). I would like to extract most of the informations given in the <entry> container and put them in a (X)HTML document.

I'm able to perform a search and get the wanted element contents. If I search for the term ἄγγελος either in entry/hyperlemma/orth (path A) or in cit/hyperlemma/orth (path B), it is found once in entry01 in path A and twice in entry02 in path B.

The idea is that I print the content of each entry container where ἄγγελος was found, regardless of the amount of occurrences. As the term was found in entry02 twice, the entry gets (of course) printed twice, but I only need it once. Would that be possible to do with XQuery? And if so, how would I do that?

My XML:

<text>
    <entry xml:id="01">
        <hyperlemma>ἄγγελος</hyperlemma>
        <lemma>ἄγγελος</lemma>
        <variant>τῶν ἀγγέλων
            <hyperlemma>
                <orth>ἄγγελος</orth>
            </hyperlemma>
        </variant>
    </entry>
    <entry xml:id="02">
        <hyperlemma>
            <orth>ангелъ</orth>
        </hyperlemma>
        <lemma>
            <orth>ангелъ</orth>
        </lemma>
        <variant>
            <orth>анг꙯ла</orth>
            <hyperlemma>
                <orth>ангелъ</orth>
            </hyperlemma>
            <cit>
                <hyperlemma>
                    <orth>ἄγγελος</orth>
                </hyperlemma>
                <lemma>
                    <orth>ἄγγελον</orth>
                </lemma>
            </cit>
        </variant>
        <variant>
            <orth>анг꙯лъ</orth>
            <hyperlemma>
                <orth>ангелъ</orth>
            </hyperlemma>
            <cit>
                <hyperlemma>
                    <orth>ἄγγελος</orth>
                </hyperlemma>
                <lemma>
                    <orth>ἄγγελος</orth>
                </lemma>
            </cit>
        </variant>
    </entry>
</text>

My XQuery:

xquery version "3.0";
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
declare option output:method "xml";
declare variable $searchphrase := "ἄγγελος";
<html>
    <head>
        <meta HTTP-EQUIV="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>
        <h1>Output of searchterm</h1>
        <p>You are looking for "<font color="red"><strong>{$searchphrase}</strong></font>"</p>
        {
        let $hyperlemmas := doc("sample_entry.xml")/(descendant::entry | descendant::cit)/hyperlemma/orth [contains(., $searchphrase)]
        return
        <p>{$searchphrase} was found {count($hyperlemmas)} times.</p>
        }
        {
        let $hyperlemmas := doc("sample_entry.xml")/(descendant::entry | descendant::cit)/hyperlemma/orth [contains(., $searchphrase)]
        for $hyperlemma in $hyperlemmas
        let $entry_id := $hyperlemma/ancestor::entry/@xml:id
        let $lemma := $hyperlemma/ancestor::entry/lemma/orth
        let $variant := $hyperlemma/ancestor::entry/variant/orth
        return
        <div>
            Entry {string($entry_id)}:<br/>
            Lemma: {$lemma} //
            {
            for $form in $variant
            return
            <i>{$form}</i>
            }
        </div>      
        }
    </body>
</html>
smo
  • 89
  • 8

2 Answers2

1

As you're using XQuery 3.0, a quick solution would be to group by entry IDs, which you're resolving anyway:

(: snip :)
for $hyperlemma in $hyperlemmas
let $entry_id := $hyperlemma/ancestor::entry/@xml:id
group by $entry_id
let $lemma := $hyperlemma/ancestor::entry/lemma/orth
let $variant := $hyperlemma/ancestor::entry/variant/orth
return
  (: snip :)

A more elegant solution (but pretty much resulting in a complete rewrite of the query) would be to loop over entry elements instead, and for each of those finding the first match and print this.

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • Thanks a lot for your suggestions! In fact, `group by $entry_id` didn't work out, as I'm not sure what exactly happens when using the `group by` statement (and comprehensive documentation on that feature seems to be sparce). I had thought of looping over the `entry` elements myself before - to no avail. After sitting down and trying hard again yesterday, I finally came up with some code that does what I had in mind. I'll post it in another answer. I now have to test this with my much more complex XML data (stored in an eXist db), which involves using a Lucene index as well. Wish me luck! – smo May 22 '15 at 14:25
  • With BaseX and the example data you posted it worked out fine. – Jens Erat May 22 '15 at 14:28
  • I tested my code outside of eXist, using Oxygen Editor. Inserting `group by $entry_id` had to be done directly before the return statement, otherwise it would not run. – smo May 22 '15 at 15:32
  • While the number of entries was correct, the output of entry02 was like `Lemma: ангелъ ангелъ // анг꙯ла анг꙯лъ анг꙯ла анг꙯лъ `, so the output was somehow reduplicated. As I didn't find any satisfying documentation on how exactly the `group by` statement works, I tried figuring out looping over the `entry` elements - which in the end proved to be more successful. But thanks again for your help! – smo May 22 '15 at 15:41
1

I finally figured out myself how I can just print certain elements inside of the entry tag when a given searchterm can be found at different postions. Here is the rewritten XQuery code that (for now) works for me and gives the intended results:

xquery version "3.0";
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
declare option output:method   "xml";
declare variable $searchphrase := "ἄγγελος";
<html>
    <head>
        <meta HTTP-EQUIV="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>
        <h1>Output of searchterm</h1>
        <p>You are looking for "<font color="red"><strong>{$searchphrase}</strong></font>"</p>
        {
        let $hyperlemmas := doc("sample_entry.xml")/(descendant::entry | descendant::cit)/hyperlemma/orth [contains(., $searchphrase)]
        let $ids := $hyperlemmas/ancestor::entry/@xml:id
        return
        <p>{$searchphrase} was found {count($hyperlemmas)} times. IDs: {data($ids)} </p>
        }
        {   
        let $entry_base := doc("sample_entry.xml")/text

        for $entry in $entry_base/entry
        let $id := $entry/@xml:id
        let $variant := $entry/variant/orth
        let $found_pos1 := $entry/hyperlemma/orth
        let $found_pos2 := $entry/descendant::cit/hyperlemma/orth
        where $found_pos1 = $searchphrase or $found_pos2 = $searchphrase
        return
        <div>ID {data($id)}:<br/>Lemma: {$entry/lemma/orth}<br/>
            {
            for $item in $variant
            return
            <div>Variant: {$item}
                {
                for $cit in $item/../cit/lemma
                where exists($cit)
                return
                <i>-> {$cit}</i>
                }
            </div>
            }
        </div>
        }
    </body>
</html>
smo
  • 89
  • 8