2

I do have a xml file (at about 3gb) containing 150k entrys. sample entry:

<entry>
.... lots of data here ....
<customer-id>1</customer-id>
</entry>

Each of theese entrys do have a specific customer-id. I have to filter the dataset based on a blacklist (sequence of 3k ids) f.e

let $blacklist-customers := ('0',
'1',
'2',
'3',
....
'3000')

I currently do the check whether or not the customer-id from each entry is included within the blacklist like this:

for $entry in //entry
    let $customer-id:= $entry//customer-id
    let $inblacklist := $blacklist = //$customer-id
return if (not($inblacklist)) then $entry else ()

If it is not included, it will be returned.

Following this approach, after at about 2 minutes of processing I do get an out of main memory error.

I tried to adjust the code so that I group first and only ask for each group whether or not it is included in the blacklist. But I still do get an out of main memory error that way.

for $entry in //entry
    let $customer-id:= $entry//customer-id
    group by $customer-id
    let $inblacklist := $blacklist = //$customer-id
return if (not($inblacklist)) then $entry else ()

The processing takes place in basex. What are the reasons for the out of main memory error and what is the best approach to solve this problem? Also does grouping the data reduce the amount of iterations needed if I follow the second approach or not?

Snackaholic
  • 590
  • 7
  • 27
  • It seems simply doing `//entry[not(customer-id = $blacklist)]` gives a much simple expression. As for the out of memory, is the file in an BaseX database? How are you running the query (command line, GUI), where/how are you trying to save or display the query result? – Martin Honnen Dec 13 '19 at 12:50
  • Im running the programm via gui. the file itself is loaded into the programm, there i execute the query. Im not saving the result as i would with the programm since it stops with the exception. – Snackaholic Dec 13 '19 at 13:38
  • Just a general comment, your code can be easier to write and understand if you leverage the `WHERE` clause in your FLWOR, instead of if/else in your return: `return if (not($inblacklist)) then $entry else ()` could be written as `where not($inblacklist) return $entry` – Mads Hansen Dec 13 '19 at 14:01
  • Also, have you tried moving the blacklist filter into a predicate when selecting the entry? `for $entry in //entry[not(.//customer-id = $blacklist)]` That might limit the set of entries to group by, thus reducing memory demands – Mads Hansen Dec 13 '19 at 14:06
  • I am not sure how well the BaseX GUI handles the task of displaying 147k entries, does a selection of all entries with `//entry` not run out of memory? – Martin Honnen Dec 13 '19 at 14:07
  • martin:i did not try the predicate yet, it might work. no the result shown is cut at about 1000 entries when i do load the database into basex. but it gives information about how many results there are. i can try to execute it via commandline but i will ahve to check the manual for that mads: thank you for that suggestion.i did not try your described procedure yet, i will have to return to this task later – Snackaholic Dec 13 '19 at 14:52
  • the file itself will be loaded into the programm once you start it you can initialize the database with it – Snackaholic Dec 13 '19 at 14:57
  • should i apply //entry[not(customer-id = $blacklist)] before grouping or shall i not group at all? is there any positive effect behind it? – Snackaholic Dec 16 '19 at 07:17
  • Following the approach without for loop over all entrys takes about 12 minutes of processing and works! tyvm the query itself gets optimized like this: db:open-pre("mydatabase.xml", 0)/*:entrys/*:entry[not((lower-case(*:customer-id) = ("123","1234","12345"...)))]/*:customer-id – Snackaholic Dec 16 '19 at 08:15
  • I don't see why grouping would help. I would make sure I set up an index for the `customer-id` http://docs.basex.org/wiki/Index#Text_Index – Martin Honnen Dec 16 '19 at 09:41

0 Answers0