druid sql query - count distinctly for a multi value field across records

Question

Is there a way to do a distinct count across different rows for a multi-value field in druid SQL for a particular value in which value is only counted once across an array? eg suppose I have below records :

shippingSpeed 
[standard, standard, standard, ground]
[standard,ground]
[ground,ground]

Expected Result:

standard 2
ground 3

I tried below query but it is aggregating the field count inside an array and then giving the total count across all records:

SELECT
"shippingSpeed", count(*)
FROM orders
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '30' DAY
GROUP BY 1
ORDER BY 2 ASC

Result:

standard 4
ground 4

score 1 · Answer 1 · answered Jan 26 '23 at 01:55

This is because the Group By on multi-valued columns will UNNEST the array into multiple rows. It is counting each item as an instance correctly.

If you want to remove duplicates, define "shippingSpeed" at ingestion time with the property: "multiValueHandling": "SORTED_SET"

You can find more details here: https://druid.apache.org/docs/latest/querying/multi-value-dimensions.html#overview

score 1 · Answer 2 · answered Feb 03 '23 at 14:22

Okay there are some undocumented function's that you can use.

SELECT
array_set_add(MV_TO_ARRAY("shippingSpeed",null) , count(*)
FROM orders
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '30' DAY
GROUP BY 1
ORDER BY 2 ASC

which might work.

MV_TO_ARRAY -> converts the multi value col to an array

array_set_add -> creates a set out of the arrays. Since we donot have 2 arrays, second argument is null.

but what @sergio said might be the easiest option.

druid sql query - count distinctly for a multi value field across records

2 Answers2