3

I am having a hard time converting this simple SQL Query below into Druid:

SELECT country, city, Count(*) 
FROM people_data 
WHERE name="Mary" 
GROUP BY country, city;

So I came up with this query so far:

{
  "queryType": "groupBy",
  "dataSource" : "people_data",
  "granularity": "all",
  "metric" : "num_of_pages",
  "dimensions": ["country", "city"],
  "filter" : {
      "type" : "and",
      "fields" : [
          {
            "type": "in",
            "dimension": "name",
            "values": ["Mary"]
          },
          {
            "type" : "javascript",
            "dimension" : "email",
            "function" : "function(value) { return (value.length !== 0) }"
          }
      ]
  },
  "aggregations": [

    { "type": "longSum", "name": "num_of_pages", "fieldName": "count" }
  ],
  "intervals": [ "2016-07-20/2016-07-21" ]
}

The query above runs but it doesn't seem like groupBy in the Druid datasource is even being evaluated since I see people in my output with names other than Mary. Does anyone have any input on how to make this work?

1 Answers1

2

Simple answer is that you cannot select arbitrary dimensions in your groupBy queries.

Strictly speaking even SQL query does not make sense. If for a given combination of country, city there are many different values of name and street, then how do you squeeze that into a single row? You have to aggregate them, e.g. by using max function.

In this case you can include the same column in your data as both dimension and metric, e.g. name_dim and name_metric, and include corresponding aggregation over your metric, max(name_metric).

Please note, that if these columns, name etc, have high granularity values, then that will kill Druid's roll-up feature.

Nikem
  • 5,716
  • 3
  • 32
  • 59
  • I have updated the query above to make it more useful. After doing a `group by` on country and city, I grab the `country`, `city`, and `the count` of all of those rows in every group to see which country and city have the most amount of people named `Mary`. But do you happen to know how I can translate this query into Druid code (the `JSON` above)? –  Jul 25 '16 at 20:44
  • but your inner query seems to be exactly what you need. Group By with filter and `longSum` aggregations. Remove that outer query and try only the inner one. – Nikem Jul 26 '16 at 07:16