0

I'm using ArangoDB for a Web Application through Strongloop. I've got some performance problem when I run this query:

FOR result IN Collection SORT result.field ASC RETURN result

I added some index to speed up the query like skiplist index on the field sorted.

My Collection has inside more than 1M of records.

The application is hosted on n1-highmem-2 on Google Cloud. Below some specs:

  • 2 CPUs - Xeon E5 2.3Ghz
  • 13 GB of RAM
  • 10GB SSD

Unluckly, my query spend a lot of time to ending. What can I do?

Best regards, Carmelo

Community
  • 1
  • 1
carmelolg
  • 493
  • 5
  • 17
  • If there is a skiplist index present on the `field` attribute, can you check whether it is a sparse index? You can find out by running `db.Collection.getIndexes();` in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required. However, the query result will contain 1,000,000 documents still, and building this result set will definitely take some time and memory. Is it actually required to retrieve all 1M documents at once with a single query? – stj Oct 29 '15 at 16:09
  • I already tried with sparse index. – carmelolg Oct 29 '15 at 17:05
  • If the index is sparse, it can **not** be used for this query. If the index is non-sparse it can be used to iterate over the documents in sorted order, which will eliminate the `SORT` step, as the sorting is already provided by the index. However, the query will still build a huge result in memory which will take time and consume RAM. Is there a possibility to restrict the query result to just some documents, i.e. by using a `FILTER`? Otherwise you'll always export the whole 1M documents to the client, which may or may not be able to handle that amount. – stj Oct 29 '15 at 17:18
  • With no sparse index it works (index is matched). But if I want a large result set, because I have to do something in batch, what can I do? – carmelolg Oct 29 '15 at 17:28
  • If the result set is big, then it will definitely take time and memory to build it. To avoid this, you could split the query into chunks using `LIMIT`. For example, first iteration: `FOR result IN Collection SORT result.field LIMIT 10000 RETURN result`. Then process these first 10,000 documents offline, and note the `result` value of the last processed document. Now run the query again, but now with an additional FILTER: `FOR result IN Collection FILTER result.field > @lastValue LIMIT 10000 RETURN result` until there are no more documents. That should work fine if `result.field` is unique. – stj Oct 29 '15 at 17:50
  • If `result.field` is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation. Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already. – stj Oct 29 '15 at 17:51
  • Your explication was been really useful, thank you very much. – carmelolg Nov 02 '15 at 10:02
  • Is this still an issue? or can we mark this resolved? – dothebart Nov 23 '15 at 17:23
  • Yes, u can mark as resolved, thanks – carmelolg Nov 25 '15 at 07:58
  • I've put everything together in an answer below, please mark it as accepted ;-) – dothebart Dec 04 '15 at 16:27

1 Answers1

4

Summarizing the discussion above:

If there is a skiplist index present on the field attribute, it could be used for the sort. However, if its created sparse it can't. This can be revalidated by running

db.Collection.getIndexes();

in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required - which can be revalidated using Explain. However, the query will still build a huge result in memory which will take time and consume RAM.

If a large result set is desired, LIMIT can be used to retrieve slices of the results in several chunks, which will cause less stress on the machine.

For example, first iteration:

FOR result IN Collection SORT result.field LIMIT 10000 RETURN result

Then process these first 10,000 documents offline, and note the result value of the last processed document. Now run the query again, but now with an additional FILTER:

FOR result IN Collection
  FILTER result.field > @lastValue LIMIT 10000 RETURN result

until there are no more documents. That should work fine if result.field is unique.

If result.field is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation.

Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already.

CodeManX
  • 11,159
  • 5
  • 49
  • 70
dothebart
  • 5,972
  • 16
  • 40