1

I have a fairly large collection (~10MM elements), and I need to run queries that return fairly large result sets (~500K elements). I need those elements grouped by or ordered based on two attributes:

  1. The first attribute (let's call it A) has about 2MM distinct instances.
  2. The second attribute (let's call it B) has about 10 distinct instances.

All elements with matching (A, B) pairs need to be grouped together for processing purposes. I could, potentially, add them to a map and then process the map; however, given that I need to run quite a few of those queries in parallel, I would rather avoid creating the intermediate map and instead extract/process the groups directly from the result set.

One way to do this is to sort by (A, B); this works, as I can extract the chunks that belong together from the result set and process them as I go. However, the query execution times are way too slow. I tried adding different types of indexes to A and B, but nothing seems to help with performance. For the record, I don't need them sorted; I only need the matching (A, B) elements to show together in the result set.

Is there a way to achieve this? Any help is appreciated.

Al A
  • 175
  • 2
  • 15

1 Answers1

2

There might be a few ways to optimize this.

In CQEngine I tend to use the term ordering to describe the desired outcome, and then there can be multiple ordering strategies to achieve the ordering.

The default behavior in CQEngine is to use what's known as the materialize ordering strategy, which involves copying all of the results matching a query into a temporary collection which is then sorted explicitly. This does not scale well to large ResultSets, so I'm not surprised that it would be slow for a ResultSet of 500K elements in this case. See OrderingStrategies for details.

However the TL;DR is: you could request CQEngine to use the index ordering strategy.

Index-accelerated ordering on attribute A

In your example you need to order results by (A, B), so you would request CQEngine to use an index on attribute A to accelerate the ordering. As you need to order by A and then B, this won't completely eliminate the need to sort results, because if the engine finds multiple objects which match the query colocated in a single bucket in index A, it will still have to explicitly sort a small number of matching objects in that bucket by attribute B in order to achieve the overall ordering by (A, B).

All other things being equal, based on a collection size of 10MM elements, and 2MM distinct values for A, you can expect each bucket in the index on A to contain 5 objects. So I'd expect that this approach could achieve a major reduction in time-to-first-result latency. It might not be perfect though...

Partial indexes

The approach above generally works well for data which resembles time-series, where A might be a timestamp, and you want to search recent items and your want to order results by recency too.

However a downside of using an index on attribute A to drive the search for workloads which don't resemble time-series, is that the ordered index on A which is being traversed to achieve the ordering of results, might be polluted with lots of objects which don't match the query you are trying to accelerate. So this can introduce filtering overhead.

You didn't mention much about your queries. For example do you know the set of queries in advance, or are they completely arbitrary?

I ask about this, because if you know which queries you expect in advance (or at least some fragments of the queries you expect in advance), then you could set up PartialIndexes on attribute A which are configured with filter queries you expect to find in your workload.

Partial indexes (plus enabling the index ordering strategy) can be a great way to mitigate index pollution, which means an index can be used to accelerate the ordering of large ResultSets, and it won't be susceptible to excessive filtering. Hope that helps!

npgall
  • 2,979
  • 1
  • 24
  • 24
  • Thanks for the response - I tried both suggestions, but CQ seems to simply ignore the indexes and go for the post-sorting. I tried to play with the INDEX_ORDERING_SELECTIVITY, but when I specify that option, then the sorting stops working. – Al A Sep 11 '17 at 21:05
  • Going into more details, `A` and `B` are both strings, and they are not part of the where clause. I don't know the set of queries in advance, this is for a shared cache service, and there is no telling how people will query it. I have also tried to create a composite attribute on `A` and `B`, simply concatenating them, and then creating a `NavigableIndex` on it, but this also seems to have no effect - it still does the post-sorting. – Al A Sep 11 '17 at 21:10
  • I do have control, however, of the "use case" query I mentioned above. – Al A Sep 11 '17 at 21:12
  • There are a number of requirements that need to be met, in order for the engine to leverage an index for ordering. There is however a way to have it log its decisions which might tell us what's going on: https://github.com/npgall/cqengine/blob/2.12.1/code/src/main/java/com/googlecode/cqengine/engine/CollectionQueryEngine.java#L356 – npgall Sep 12 '17 at 22:18
  • I guess stack overflow is not the best medium to debug this particular issue in detail. There is a discussion forum for the project. If you'd like to post details of the problem there, I can help you to enable debug logging to see what's going on. http://groups.google.com/forum/#!forum/cqengine-discuss – npgall Sep 12 '17 at 22:23