how do we write sum(distinct col) in druid ? if i try to write in druid, it says plans can't be build, but same is possible in Druid. I tried to convert to subquery approach, but my inner query returns lot of item level data, hence timing out.
1 Answers
The distinct count or sum is not something which is by default supported by druid.
There are actually several methods which give you a similar result.
Option 1. Theta Sketch extension (recommended)
If you enable the Theta Sketch extension (See https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html) you can use this to get the same result.
Example:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "cardinality",
"name": "col",
"fields": [
{
"type": "default",
"dimension": "domain",
"outputType": "string",
"outputName": "domain"
}
],
"byRow": false,
"round": false
}
]
}
Result:
+--------+
| domain |
+--------+
| 22 |
+--------+
Option 2: cardinality
The cardinality() aggregation computes the cardinality of a set of Apache Druid (incubating) dimensions, using HyperLogLog to estimate the cardinality.
Example:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "cardinality",
"name": "domain",
"fields": [
{
"type": "default",
"dimension": "domain",
"outputType": "string",
"outputName": "domain"
}
],
"byRow": false,
"round": false
}
]
}
Response:
+-----------------+
| domain |
+-----------------+
| 22.119017166376 |
+-----------------+
Option 3. use hyperUnique
This option requires that you keep track of the counts at indexation time. If you have applied this, you can use this in your query:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "hyperUnique",
"name": "domain",
"fieldName": "domain",
"isInputHyperUnique": false,
"round": false
}
],
"context": {
"groupByStrategy": "v2"
}
}
As I have no hyperUnique metric in my data set, I have no exact example response.
This page explains this method very well: https://blog.mshimul.com/getting-unique-counts-from-druid-using-hyperloglog/
Conclusion
In my opinion the Theta Sketch extension is the best and most easy way to get the result. Please read the documentation carefully.
If you are an PHP user you could take a look at this, maybe it helps:

- 579
- 2
- 15