N1Ql query grouping on distinct values with different keys

Question

I have ~7 million docs in a bucket and I am struggling to write the correct query/index combo to prevent it from running >5 seconds.

Here is a similar scenario to the one I am trying to solve:

I have multiple coffee shops each making coffee with different container/lid combos. These field key’s are also different for different doc types. With each sale being generated I keep track of these combos.

Here are a few example docs:

[{
    "shopId": "x001",
    "date":    "2022-01-01T08:49:00Z",
    "cappuccinoContainerId": "a001",
    "cappuccinoLidId": "b001"
},
{
    "shopId": "x001",
    "date":    "2022-01-02T08:49:00Z",
    "latteContainerId": "a002",
    "latteLidId": "b002"
},
{
    "shopId": "x001",
    "date":    "2022-01-02T08:49:00Z",
    "espressoContainerId": "a003",
    "espressoLidId": "b003"
},
{
    "shopId": "x002",
    "date":    "2022-01-01T08:49:00Z",
    "cappuccinoContainerId": "a001",
    "cappuccinoLidId": "b001"
},
{
    "shopId": "x002",
    "date":    "2022-01-02T08:49:00Z",
    "latteContainerId": "a002",
    "latteLidId": "b002"
},
{
    "shopId": "x002",
    "date":    "2022-01-02T08:49:00Z",
    "espressoContainerId": "a003",
    "espressoLidId": "b003"
}]

What I need to get out of the query is the following:

[{
    "shopId": "x001",
    "day":    "2022-01-01",
    "uniqueContainersLidsCombined": 2
},
{
    "shopId": "x001",
    "day":    "2022-01-01",
    "uniqueContainersLidsCombined": 4
},
{
    "shopId": "x002",
    "day":    "2022-01-01",
    "uniqueContainersLidsCombined": 2
},
{
    "shopId": "x002",
    "day":    "2022-01-01",
    "uniqueContainersLidsCombined": 4
}]

I.e. I want the total number of unique containers and lids combined per site and day.

I have tried using composite, adaptive and FTS indexes but I unable to figure this one out.

Does anybody have a different suggestion? Can someone please help?

vsr · Answer 1 · 2022-07-09T13:34:57.650

0

CREATE INDEX ix1 ON default(shopId, DATE_FORMAT_STR(date,"1111-11-11"), [cappuccinoContainerId, cappuccinoLidId]);

If Using EE and shopId is immutable add PARTITION BY HASH (shopId) to above index definition (with higher partition numbers).

SELECT d.shopId,
       DATE_FORMAT_STR(d.date,"1111-11-11") AS day
       COUNT(DISTINCT [d.cappuccinoContainerId, d.cappuccinoLidId]) AS uniqueContainersLidsCombined
FROM default AS d
WHERE d.shopId IS NOT NULL
GROUP BY d.shopId, DATE_FORMAT_STR(d.date,"1111-11-11");

Adjust index key order of shopId, day based on the query predicates. https://blog.couchbase.com/understanding-index-grouping-aggregation-couchbase-n1ql-query/

Update:

Based on EXPLAIN you have date predicate and all shopIds so use following index

CREATE INDEX ix2 ON default( DATE_FORMAT_STR(date,"1111-11-11"), shopId, [cappuccinoContainerId, cappuccinoLidId]);

As you need to DISTINCT of cappuccinoContainerId, cappuccinoLidId storing as single key (array of 2 elements) as [cappuccinoContainerId, cappuccinoLidId]. The advantage of this you can directly reference in COUNT as DISTINCT this allows use index aggregation. (NO DISTINCT in the Index that turns into ARRAY index and things will not work as expected .

edited Jul 09 '22 at 13:34

answered Jul 05 '22 at 18:08

vsr

7,149
1
11
10

Thank you so much @vsr! I have not seen any examples where a array is created inside the index. Could you please explain what is happening there? And would it help to use DISTINCT in the index? CREATE INDEX ix1 ON default(..., DISTINCT[x,y,x])? – Johan Ferreira Jul 06 '22 at 07:35
And is there a reason why you don't use an adaptive index? – Johan Ferreira Jul 06 '22 at 07:44
Also this doesn't seem to improve the performance. Here is the plan text of the query: https://gist.github.com/das-solo/ddb9831b42a293336f86033dc06c06c6 – Johan Ferreira Jul 06 '22 at 07:53
Based on query explain use the following index CREATE INDEX ix1 ON default(DATE_FORMAT_STR(date,"1111-11-11"), shopId,[cappuccinoContainerId, cappuccinoLidId]); – vsr Jul 09 '22 at 07:30

Jon Strabala · Answer 2 · 2022-07-09T17:05:04.197

I assume

That the cup types and lid types can be used for any drink type.
That you don't want to add any precomputed stuff to your data.

Perhaps an index like this my collection keyspace is in bulk.sales.amer, note I am not sure if this performs better or worse (or even if it is equivalent) WRT the solution posted by vsr:

CREATE INDEX `adv_shopId_concat_nvls` 
ON `bulk`.`sales`.`amer`(
  `shopId` MISSING,
  (
    nvl(`cappuccinoContainerId`, "") || 
    nvl(`cappuccinoLidId`, "") || 
    nvl(`latteContainerId`, "") || 
    nvl(`latteLidId`, "") || 
    nvl(`espressoContainerId`, "") || 
    nvl(`espressoLidId`, "")),substr0(`date`, 0, 10)
  )

And then a using the covered index above do your query like this:

SELECT 
   shopId,
   CONCAT( 
     NVL(cappuccinoContainerId,""), 
     NVL(cappuccinoLidId,""), 
     NVL(latteContainerId,""), 
     NVL(latteLidId,""), 
     NVL(espressoContainerId,""), 
     NVL(espressoLidId,"") 
   ) AS uniqueContainersLidsCombined,
   SUBSTR(date,0,10) AS day,
   COUNT(*) AS cnt
FROM `bulk`.`sales`.`amer`
GROUP BY 
  shopId,
  CONCAT( 
    NVL(cappuccinoContainerId,""), 
    NVL(cappuccinoLidId,""), 
    NVL(latteContainerId,""), 
    NVL(latteLidId,""), 
    NVL(espressoContainerId,""), 
    NVL(espressoLidId,"") 
  ),
  SUBSTR(date,0,10)

Note I used the following 16 lines of data:

{"amer":"amer","date":"2022-01-01T08:49:00Z","cappuccinoContainerId":"a001","cappuccinoLidId":"b001","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-01T08:49:00Z","cappuccinoContainerId":"a001","cappuccinoLidId":"b001","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","latteContainerId":"a002","latteLidId":"b002","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","latteContainerId":"a002","latteLidId":"b002","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","espressoContainerId":"a003","espressoLidId":"b003","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","espressoContainerId":"a003","espressoLidId":"b003","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","cappuccinoContainerId":"a007","cappuccinoLidId":"b004","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","cappuccinoContainerId":"a007","cappuccinoLidId":"b004","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","latteContainerId":"a007","latteLidId":"b004","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","latteContainerId":"a007","latteLidId":"b004","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T01:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T02:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T03:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T04:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T05:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T06:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}

Applying some sorting by wrapping the above query with

SELECT T1.* FROM
( 
-- paste above --
) AS T1
ORDER BY T1.day, T1,shopid, T1.uniqueContainersLidsCombined

We get

cnt day shopId  uniqueContainersLidsCombined
1   "2022-01-01"    "x001"  "a001b001"
1   "2022-01-01"    "x002"  "a001b001"
1   "2022-01-02"    "x001"  "a002b002"
1   "2022-01-02"    "x001"  "a003b003"
1   "2022-01-02"    "x002"  "a002b002"
1   "2022-01-02"    "x002"  "a003b003"
1   "2022-01-03"    "x001"  "a007b005"
2   "2022-01-03"    "x001"  "a007b004"
2   "2022-01-03"    "x002"  "a007b004"
5   "2022-01-03"    "x002"  "a007b005"

If you still don't get the performance you need, you could possibly use the Eventing service to do a continuous map/reduce and an occasional update query to make sure things stay perfectly in sync.

N1Ql query grouping on distinct values with different keys

2 Answers2