0

Right now the query takes around 2 minutes , before I made some changes it took 3:48m.

The xml documents are taken from a web pages , due to it changes each 5m and gives information about the buses in real time.

Could you help me to optimize this query?

xquery version "3.0";
declare namespace bus="http://docs.gijon.es/sw/busgijon.asmx";

declare function local:getNombreParada($numero)
{
    for $parada in doc("http://datos.gijon.es/doc/transporte/busgijoninfo.xml")//paradas/bus:parada
    where $numero=$parada/bus:idparada
    return $parada/bus:descripcion
};

declare function local:getBusesPorLinea($linea)
{

    let $numero:=$linea
    let $nBuses:=count(doc("http://datos.gijon.es/doc/transporte/busgijontr.xml")//bus:llegada[bus:idlinea=$numero])

    return 
    if($nBuses=0)
    then(<p>No hay ningun bus en esta linea</p>)
    else(
    <div>
        <h2>Numero de buses funcionando en la linea {$numero} : {$nBuses}</h2>

    <table class="table table-hover">
        <thead>
          <tr>
            <th>Parada</th>
            <th>Minutos hasta la llegada</th>
          </tr>
        </thead>
        <tbody>
            {
            for $l in doc("http://datos.gijon.es/doc/transporte/busgijontr.xml")//bus:llegada[bus:idlinea=$numero]
                for $parada in doc("http://datos.gijon.es/doc/transporte/busgijoninfo.xml")//paradas/bus:parada[bus:idparada=$l/bus:idparada]


            return <tr>
                        <td>{$parada/bus:descripcion}</td>
                        <td>{$l/bus:minutos}</td></tr>
            }
        </tbody>
    </table>

    </div>
    )


};

local:getBusesPorLinea(1)

PD:i am running this in exist Db

  • Just a side note, what Michael Kay said is still most relevant: Try to avoid `//` if you are having performance issues. This will always have to trigger a complete scan of all descendants and a concrete lookup (i.e. with the full path) will certainly be faster. – dirkk Jan 05 '16 at 08:24
  • 1
    @dirkk Actually that is not true for eXist. Rather the inverse is true, i.e. `//` can be much faster than using a full path due to the way that indexing works; That is assuming of course you have created the indexes first ;-) – adamretter Jan 05 '16 at 13:46
  • @adamretter Interesting, I didn't know this. I am also surprised, because a concrete path has always more information than a simple descendant-or-self operator, i.e. the optimizer could (in theory) rewrite the specific path to `//` whereas this is not possible the other way round. But I will refrain from this statement in the future for general XQuery processor, seems like this only holds true for BaseX. – dirkk Jan 05 '16 at 15:20
  • yes @adamretter is right I was also quite missed when I read that in the documentation of existDB , i post here the link in which you can find more tricks with existDB [tuning exist DB](http://exist-db.org/exist/apps/doc/tuning.xml) – Roberto Fernandez Diaz Jan 08 '16 at 23:51

4 Answers4

2

Without smart optimization, this join expression:

for $l in doc("a.xml")//bus:llegada[bus:idlinea=$numero]
  for $parada in doc("b.xml")//paradas/bus:parada[bus:idparada=$l/bus:idparada]
return <tr>...</tr>

is going to have quadratic performance. You haven't told us anything about the document sizes, but that is where I would start looking.

The way you deal with such an issue in an XML database environment is typically by creating appropriate indexes.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
2

First of all, the best way to optimize queries in eXist is to store the XML locally and with indexes. Please use the built in documentation for how to setup indexes.

However, your code suffers from fetching the same data over and over from the network. Let's take care of that and the other issue, your use of in-memory XML querying, another optimization bottleneck.

The most important first step is to have the XML you're querying in the local database. Queries of nodes in the database are faster and use less memory than queries against in-memory XML nodes. (At least, that was that case with versions up to 2.2.)

So, here is a method to cache the data locally, refreshing the cache after the latest update is more than 5 minutes old.

xquery version "3.0";

declare namespace bus="http://docs.gijon.es/sw/busgijon.asmx";

(: Store the XML data in the collection /db/busgijon/data :)
declare variable $COL := "/db/busgijon/data";
declare variable $INFO-FILE := "busgijoninfo.xml";
declare variable $TR-FILE := "busgijontr.xml";

(: Fetch a page from cache or from web site, updating the cache :)
declare function local:fetchPage($filename) {
    (: If the page was fetched more than 5 minutes ago, refresh it :)
    let $expire := current-dateTime() - xs:dayTimeDuration('PT5M')
    let $page := doc($COL || "/" || $filename)/page
    return if (exists($page))
        then if ($page/xs:dateTime(@timestamp) ge $expire)
            then $page
            else (update replace $page/* with doc("http://datos.gijon.es/doc/transporte/" || $filename)/*
                , update value $page/@timestamp with current-dateTime()
                , $page)
        else doc(xmldb:store($COL, $filename, <page timestamp="{current-dateTime()}">{doc("http://datos.gijon.es/doc/transporte/" || $filename)/*}</page>))/page
};

declare function local:getBusesPorLinea($linea)
{
    (: Get the two pages from the database cache for querying :)
    let $info := local:fetchPage($INFO-FILE)/bus:BusGijonInfo
    let $tr := local:fetchPage($TR-FILE)/bus:BusGijonTr

    let $numero:=$linea
    let $nBuses:=count($tr//bus:llegada[bus:idlinea=$numero])

    return 
    if($nBuses=0)
    then(<p>No hay ningun bus en esta linea</p>)
    else(
    <div>
        <h2>Numero de buses funcionando en la linea {$numero} : {$nBuses}</h2>

    <table class="table table-hover">
        <thead>
          <tr>
            <th>Parada</th>
            <th>Minutos hasta la llegada</th>
          </tr>
        </thead>
        <tbody>
            {
            (: Loop through the TR page - fetched just once from cache :)
            for $l in $tr//bus:llegada[bus:idlinea=$numero]
                (: Loop through the Info page - fetched just once from cache :)
                for $parada in $info//paradas/bus:parada[bus:idparada=$l/bus:idparada]


            return <tr>
                        <td>{$parada/bus:descripcion}</td>
                        <td>{$l/bus:minutos}</td></tr>
            }
        </tbody>
    </table>

    </div>
    )


};

local:getBusesPorLinea(1)

The only part that I changed in the local:getBusesPorLinea function was fetching the two documents at the top from the cache and using those in the embedded loops.

The local:fetchPage function is where most of the speedup occurs. Here's what it does:

  • Set expiration time to 5 minutes in the past.
  • Try to fetch the specified page from the cache.
  • If the page exists, compare the fetched timestamp to the expiration timestamp.
  • If the page's timestamp is less than 5 minutes ago (greater than the expiration timestamp), return that page.
  • If the page's timestamp is greater than 5 minutes ago, refetch it, updating the contents of the page with the refreshed document, update the page's timestamp, and return the new page.
  • If the page does not yet exist, save the page to the specified collection with the current timestamp, returning the page element.

The first person to access this XQuery after 5 minutes have passed will have about an extra 5-10 seconds as the cache gets refreshed. This allows the cache to be passive, so you don't have to manually refresh it every five minutes.

Hope this helps.

westbaystars
  • 151
  • 1
  • 4
  • Thanks i add this to the other solution that CiaPan suggest and now it takes like 2 seconds (using CiaPan suggestion it takes 4s and before that it takes like 2 minutes, I don't really know why the difference is so big) – Roberto Fernandez Diaz Jan 05 '16 at 13:40
  • If you add typed indexes in eXist-db and adjust your predicates you should be able to get this query down to 10's of ms. – adamretter Jan 05 '16 at 14:08
  • 1
    @RobertoFernandez In the `for $l` you request loading the `busgijoninfo.xml` file to be scanned for `bus:parada` elements. If the documents are not cached, then you fetch the file as many times as there is `bus:llegada` elements in the `busgijontr.xml` file satisfying the `[bus:idlinea=$numero]` predicate condition. That means about 30 file loads for `$numero=4`, 60 for `$numero=1` or 65 for `$numero=20`. That's possibly why a single fetch reduces the running time to approx. 1/30 of the previous value . – CiaPan Jan 06 '16 at 09:23
1

Are documents cached? I'm not an expert, but your code seems to access the same document multiple times. That's OK if you're sure the content is cached in the executing environment. Otherwise I would try to declare

declare variable $docinfo := doc("http://datos.gijon.es/doc/transporte/busgijoninfo.xml");
declare variable $doctr   := doc("http://datos.gijon.es/doc/transporte/busgijontr.xml");

to make sure the files are fetched just once.

You also scan documents at least twice for same kind of data. I would do that once:

declare variable $paradas  := $docinfo//paradas;
declare variable $llegadas := $doctr//bus:llegada;

then only filter the collections:

declare function local:getNombreParada($numero)
{
    $paradas/bus:parada[bus:idparada = $numero]/bus:descripcion
};

declare function local:getBusesPorLinea($linea)
{
    let $numero:=$linea
    let $llegadasNum:=$llegadas[bus:idlinea=$numero]
    let $nBuses:=count($llegadasNum)

    return 

    if($nBuses=0)
    then(<p>No hay ningun bus en esta linea</p>)
    else(
    <div>
        <h2>Numero de buses funcionando en la linea {$numero} : {$nBuses}</h2>

    <table class="table table-hover">
        <thead>
          <tr>
            <th>Parada</th>
            <th>Minutos hasta la llegada</th>
          </tr>
        </thead>
        <tbody>
            {
            for $l in $llegadasNum
                for $parada in $paradas/bus:parada[bus:idparada=$l/bus:idparada]
                return <tr>
                        <td>{$parada/bus:descripcion}</td>
                        <td>{$l/bus:minutos}</td></tr>
            }
        </tbody>
    </table>

    </div>
    )
};

May be that is not much faster, but I hope it is a bit more readable.

CiaPan
  • 9,381
  • 2
  • 21
  • 35
1

Another tip: for queries in eXist-db, it is better to avoid to use the where clause. An XPath predicate typically performs much better.

Quite a few tips are on http://exist-db.org/exist/apps/doc/tuning.xml?q=performance&field=all&id=D2.2.2#D2.2.6

DiZzZz
  • 621
  • 3
  • 12