I do have a xml file (at about 3gb) containing 150k entrys. sample entry:
<entry>
.... lots of data here ....
<customer-id>1</customer-id>
</entry>
Each of theese entrys do have a specific customer-id. I have to filter the dataset based on a blacklist (sequence of 3k ids) f.e
let $blacklist-customers := ('0',
'1',
'2',
'3',
....
'3000')
I currently do the check whether or not the customer-id from each entry is included within the blacklist like this:
for $entry in //entry
let $customer-id:= $entry//customer-id
let $inblacklist := $blacklist = //$customer-id
return if (not($inblacklist)) then $entry else ()
If it is not included, it will be returned.
Following this approach, after at about 2 minutes of processing I do get an out of main memory error.
I tried to adjust the code so that I group first and only ask for each group whether or not it is included in the blacklist. But I still do get an out of main memory error that way.
for $entry in //entry
let $customer-id:= $entry//customer-id
group by $customer-id
let $inblacklist := $blacklist = //$customer-id
return if (not($inblacklist)) then $entry else ()
The processing takes place in basex. What are the reasons for the out of main memory error and what is the best approach to solve this problem? Also does grouping the data reduce the amount of iterations needed if I follow the second approach or not?