4

I've got a BaseX XML database with only one small XML file. These file basically consists of two structures. One is PlatformCategory with 46 instances, the other one PlatformGenericType with 213 instances.

PlatformGenericType has references to PlatformCategory in the href attribute.

<PlatformGeneralType id="/plib/platformgeneraltypes/pgt1">
  <name>No statement</name>
  <enum>NO_STATEMENT</enum>
  <isOfPlatformCategory href="/plib/platformcategories/pc1"/>
  <readOnly>true</readOnly>
</PlatformGeneralType>

<PlatformCategory id="/plib/platformcategories/pc1">
  <name>No statement</name>
  <enum>NO_STATEMENT</enum>
  <environment>AIR</environment>
  <readOnly>true</readOnly>
</PlatformCategory>

When I perform following query, it takes about six seconds to get the result:

//PlatformGeneralType[isOfPlatformCategory/@href=//PlatformCategory[environment="AIR"]/@id]

What can I do to optimize this query?

Note that I ran "optimize all".

Update: The issue on the previous query seems to be solved. But when I extends the query with following, the query takes 44,28 seconds:

/PLib/PlatformSpecificTypes/PlatformSpecificType
[isOfPlatformGeneralType/@href=/PLib/PlatformGeneralTypes/PlatformGeneralType
    [isOfPlatformCategory/@href=/PLib/PlatformCategories/PlatformCategory
        [environment='AIR']/@id]/@id]

The there are 8939 instances of PlatformSpecificType and its structure:

<PlatformSpecificTypes>
    <PlatformSpecificType id="/plib/platformspecifictypes/DataShip.3">
        <name>Meko 360H2</name>
        <lethalityLevel>LOW</lethalityLevel>
        <isOfPlatformGeneralType href="/plib/platformgeneraltypes/pgt62"/>
        <ownedByCountry href="/plib/countries/10"/>
    </PlatformSpecificType>
</PlatformSpecificTypes>

Its query info:

Query: /PLib/PlatformSpecificTypes/PlatformSpecificType[isOfPlatformGeneralType/@href=/PLib/PlatformGeneralTypes/PlatformGeneralType[isOfPlatformCategory/@href=/PLib/PlatformCategories/PlatformCategory[environment='AIR']/@id]/@id] Result: - Hit(s): 3642 Items - Updated: 0 Items - Printed: 2048 KB - Read Locking: local [command_plib] - Write Locking: none Timing: - Parsing: 1.25 ms - Compiling: 0.71 ms - Evaluating: 44248.94 ms - Printing: 37.11 ms - Total Time: 44288.02 ms Query plan:

Database properties:

Database Properties
 Name: command_plib
 Size: 20247 KB
 Nodes: 781606
 Documents: 1
 Binaries: 0
 Timestamp: 2015-06-12-10-12-14

Resource Properties
 Input Path: /home/sceran/Documents/PLIB/command_plib.xml
 Input Size: 21354 KB
 Timestamp: 2015-06-11-15-34-07
 Encoding: UTF-8
 CHOP: true

Indexes
 Up-to-date: true
 TEXTINDEX: true
 ATTRINDEX: true
 FTINDEX: false
 LANGUAGE: English
 STEMMING: true
 CASESENS: true
 DIACRITICS: false
 STOPWORDS: 
 UPDINDEX: false
 AUTOOPTIMIZE: false
 MAXCATS: 100
 MAXLEN: 96

Query info:

Compiling:
- rewriting descendant-or-self step(s)
- rewriting descendant-or-self step(s)
- converting descendant::*:PlatformGeneralType[(*:isOfPlatformCategory/@*:href = root()/descendant::*:PlatformCategory[(*:environment = "AIR")]/@*:id)] to child steps
Query:
//PlatformGeneralType[isOfPlatformCategory/@href=//PlatformCategory[environment="AIR"]/@id]
Optimized Query:
db:open-pre("command_plib",0)/*:PLib/*:PlatformGeneralTypes/*:PlatformGeneralType[(*:isOfPlatformCategory/@*:href = root()/descendant::*:PlatformCategory[(*:environment = "AIR")]/@*:id)]
Result:
- Hit(s): 55 Items
- Updated: 0 Items
- Printed: 12776 Bytes
- Read Locking: local [command_plib]
- Write Locking: none
Timing:
- Parsing: 0.55 ms
- Compiling: 0.3 ms
- Evaluating: 5786.29 ms
- Printing: 1.0 ms
- Total Time: 5788.15 ms
Query plan:
<QueryPlan compiled="true">
  <IterPath>
    <DBNode name="command_plib" pre="0"/>
    <IterStep axis="child" test="*:PLib"/>
    <IterStep axis="child" test="*:PlatformGeneralTypes"/>
    <IterStep axis="child" test="*:PlatformGeneralType">
      <CmpG op="=">
        <CachedPath>
          <IterStep axis="child" test="*:isOfPlatformCategory"/>
          <IterStep axis="attribute" test="*:href"/>
        </CachedPath>
        <IterPath>
          <Root/>
          <IterStep axis="descendant" test="*:PlatformCategory">
            <CmpG op="=">
              <CachedPath>
                <IterStep axis="child" test="*:environment"/>
              </CachedPath>
              <Str value="AIR" type="xs:string"/>
            </CmpG>
          </IterStep>
          <IterStep axis="attribute" test="*:id"/>
        </IterPath>
      </CmpG>
    </IterStep>
  </IterPath>
</QueryPlan>

Update two: I suspect the structure of PlatformSpecificTypes prevents indexing. I am wondering if I change it as below, will it improves the query performance?

<PlatformSpecificTypes>
    <PlatformSpecificType id="/plib/platformspecifictypes/DataShip.3">
        <name>Meko 360H2</name>
        <lethalityLevel>LOW</lethalityLevel>
        **<isOfPlatformGeneralType>/plib/platformgeneraltypes/pgt62 </isOfPlatformGeneralType>**
        <ownedByCountry href="/plib/countries/10"/>
    </PlatformSpecificType>
</PlatformSpecificTypes>

Update three: I have uploaded the XML file in a gist, so that you can inspect it.

Now, when I execute following query, I takes about 28 seconds to get the result.

/root/PlSpTys/PlSpTy[isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[isOfPlCt/@href=/root/PlCts/PlCt[environment='AIR']/@id]/@id]

Here is the query info:

 Query:
/root/PlSpTys/PlSpTy[isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[isOfPlCt/@href=/root/PlCts/PlCt[environment='AIR']/@id]/@id]
Result:
- Hit(s): 3642 Items
- Updated: 0 Items
- Printed: 257 KB
- Read Locking: local [Output6]
- Write Locking: none
Timing:
- Parsing: 0.66 ms
- Compiling: 0.34 ms
- Evaluating: 28398.32 ms
- Printing: 4.63 ms
- Total Time: 28403.97 ms
Query plan:
<QueryPlan compiled="true">
  <IterPath>
    <DBNode name="Output6" pre="0"/>
    <IterStep axis="child" test="*:root"/>
    <IterStep axis="child" test="*:PlSpTys"/>
    <IterStep axis="child" test="*:PlSpTy">
      <CmpG op="=">
        <CachedPath>
          <IterStep axis="child" test="*:isOfPlGeTy"/>
          <IterStep axis="attribute" test="*:href"/>
        </CachedPath>
        <IterPath>
          <Root/>
          <IterStep axis="child" test="*:root"/>
          <IterStep axis="child" test="*:PlGeTys"/>
          <IterStep axis="child" test="*:PlGeTy">
            <CmpG op="=">
              <CachedPath>
                <IterStep axis="child" test="*:isOfPlCt"/>
                <IterStep axis="attribute" test="*:href"/>
              </CachedPath>
              <IterPath>
                <Root/>
                <IterStep axis="child" test="*:root"/>
                <IterStep axis="child" test="*:PlCts"/>
                <IterStep axis="child" test="*:PlCt">
                  <CmpG op="=">
                    <CachedPath>
                      <IterStep axis="child" test="*:environment"/>
                    </CachedPath>
                    <Str value="AIR" type="xs:string"/>
                  </CmpG>
                </IterStep>
                <IterStep axis="attribute" test="*:id"/>
              </IterPath>
            </CmpG>
          </IterStep>
          <IterStep axis="attribute" test="*:id"/>
        </IterPath>
      </CmpG>
    </IterStep>
  </IterPath>
</QueryPlan>

Could you help me on optimizing the query duration?

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • This seems like a rather small dataset. Can you provide a full example, for example in a [gist](https://gist.github.com/) (maybe by anonymizing critical information, if there are any)? – Jens Erat Jun 12 '15 at 08:31
  • Actually, in the XML file, there are other structures but not related to the structures I mentioned above (PlatformGenericType, PlatformCategory). In order to provide the example you want, I have removed the other structures from XML file. But when I execute the query, I took only 50.5 msec. Now, I am wondering why those unrelated structures slows down the query? – user1587140 Jun 12 '15 at 08:57
  • So there's more data inside that XML file? How many results are there for `//PlatformCategory[environment="AIR"]/@id`? How long does this part of the query evaluate? Try replacing the `//` with an absolute path; it seems BaseX cannot optimize to child steps here for some reason. – Jens Erat Jun 12 '15 at 09:17
  • Are you sure this is doing what you expected? I have the feeling a `/` is missing after the `=` in front of the `PLib` (or is it just missing here?). – Jens Erat Jun 12 '15 at 09:33
  • I takes 25.61 ms to get result. I am very new in BaseX. Could you explain how to get the absolute path? – user1587140 Jun 12 '15 at 09:54
  • when I try following, Unfortunatelly, it still takes 5.6 seconds. //PlatformGeneralType[isOfPlatformCategory/@href=/PLib//PlatformCategory[environment='AIR']/@id] – user1587140 Jun 12 '15 at 09:56
  • An absolute path starts at the root node `/`, and only uses child steps (no descended-or-self steps `//`). `/PLib/PlatformGeneralTypes/PlatformGeneralType` would be an absolute path for the first part of the query (instead of `//PlatformGeneralType). `//` is pretty much always more expensive to evaluate, especially for large subtrees. – Jens Erat Jun 12 '15 at 09:58
  • After applying your solution, it takes 13.41 ms now. But when I extend the query as I mention above, it now takes 44.28 seconds. Do you think it is normal with this data structure? – user1587140 Jun 12 '15 at 11:37
  • I don't see how I could give further analysis without any data, the information given is much to incomplete. – Jens Erat Jun 12 '15 at 11:39
  • Unfortunately, I am not allowed to share the XML here due to license restrictions. – user1587140 Jun 12 '15 at 11:59
  • But the general structure is; PlatformSpecificType(8939) uses PlatformGeneralType(213) uses PlatformCategory(46). When we start query with environment "AIR", 9 items PlatformCategory in 8 ms, 55 items PlatformGeneralType in 20 ms and 3642 items PlatformSpecificType in 44 seconds return. – user1587140 Jun 12 '15 at 12:13
  • Nobody else to solve this issue? – user1587140 Jun 14 '15 at 20:04
  • Don't expect anybody to give further guesswork to a very vague question. Maybe you can "anonymize" the data. – Jens Erat Jun 14 '15 at 20:35
  • @JensErat, I have put the test data. You may want to see Update 3. – user1587140 Jun 15 '15 at 09:03

2 Answers2

3

BaseX does not seem to realize it should preprocess the "inner" part with a static result, thus the evaluation costs are somewhere around O(n^2) instead of O(n).

Reformatting your query (which takes around 30 seconds on my machine) to get a better understanding of it shows up the whole right-hand-side of the comparison inside the first predicate being static, not depending on the PlSpTy element currently analyzed:

/root/PlSpTys/PlSpTy[
  isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[
    isOfPlCt/@href=/root/PlCts/PlCt[
      environment='AIR'
    ]/@id
  ]/@id
]

Evaluation of this takes around 9ms on my machine, which is not a lot, but might get expensive if ran repeatedly. Counting the number of PlSpTy elements (count(/root/PlSpTys/PlSpTy)) reveals close to 8939 such elements, thus the evaluation of the inner part costs around 8939*9ms ~= 80s -- something must have been optimized away, but not everything.

What happens if we simply pull out this part of the query and precompute it?

let $compare :=
  /root/PlGeTys/PlGeTy[
    isOfPlCt/@href=/root/PlCts/PlCt[
      environment='AIR'
    ]/@id
  ]/@id

return
  /root/PlSpTys/PlSpTy[
    isOfPlGeTy/@href=$compare
  ]

Calculation time drops to 16ms, of which a fourth was used to actually print the results. I opened a bug report requesting better optimization. (Update: some optimizations have been applied).

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • I have also tested your solution with the real data. It seems you solved the problem. Thank you for all your help;) – user1587140 Jun 15 '15 at 11:22
  • could you please help me on this issue (http://stackoverflow.com/questions/30865291/basex-query-optimization-on-join)? – user1587140 Jun 16 '15 at 11:04
  • Some optimizations for this issue [have been included in BaseX, and should be included in the next snapshot soon](https://github.com/BaseXdb/basex/issues/1160#issuecomment-113525690). – Jens Erat Jun 19 '15 at 16:57
1

The latest snapshot of BaseX provides an optimization for your query: Results of a path expressions will now be cached if they are requested more than once. The optimization will be officially available with version 8.2.2.

Christian Grün
  • 6,012
  • 18
  • 34