9

Here is my query with public dataset in BigQuery:

SELECT RANGE_BUCKET(reputation, [400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000]) AS reputation_group, COUNT(*) AS count
FROM `bigquery-public-data.stackoverflow.users`
Where reputation > 200000
GROUP BY 1
ORDER By 1

The result is below:

enter image description here

Instead of showing reputation_group as integer, how can I show the bucket's range:

0: [0-400000]
1: [400001-500000]
2: [500001-600000]
....

Thank you very much.

UPDATE: A big thank to Mikhail's answer with a minor change below:

SELECT bucket, 
  FORMAT('%i - %i', IFNULL(ranges[SAFE_OFFSET(bucket - 1)] + 1, 0), ranges[SAFE_OFFSET(bucket)]) AS reputation_group, 
  COUNT(*) AS COUNT
FROM `bigquery-public-data.stackoverflow.users`,
UNNEST([STRUCT([200000, 400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000] AS ranges)]),
UNNEST([RANGE_BUCKET(reputation, ranges)]) bucket 
WHERE reputation > 200000
GROUP BY 1, 2
ORDER BY bucket 

Note an extra item 200000 is added to the STRUCT, which makes the result showing 200001 - 400000 instead of 0 - 400000

mdivk
  • 3,545
  • 8
  • 53
  • 91

2 Answers2

5

Below is for BigQuery Standard SQL

#standardSQL
SELECT bucket, 
  FORMAT('%i - %i', IFNULL(ranges[SAFE_OFFSET(bucket - 1)] + 1, 0), ranges[SAFE_OFFSET(bucket)]) AS reputation_group, 
  COUNT(*) AS COUNT
FROM `bigquery-public-data.stackoverflow.users`,
UNNEST([STRUCT([400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000] AS ranges)]),
UNNEST([RANGE_BUCKET(reputation, ranges)]) bucket 
WHERE reputation > 200000
GROUP BY 1, 2
ORDER BY bucket  

with result

Row bucket  reputation_group    COUNT    
1   0       0 - 400000          198  
2   1       400001 - 500000     23   
3   2       500001 - 600000     13   
4   3       600001 - 700000     12   
5   4       700001 - 800000     4    
6   5       800001 - 900000     5    
7   6       900001 - 1000000    2    
8   8       1100001 - 1200000   1    
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
1

With a JOIN and some refactoring:

WITH range_array AS (
  SELECT [400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000]
)
, buckets AS (
  SELECT LAG(bucket_end) OVER(ORDER BY reputation_group) bucket_start, *
  FROM UNNEST((SELECT * FROM range_array)) bucket_end WITH OFFSET reputation_group
)

SELECT *
  , (SELECT AS STRUCT * FROM buckets WHERE a.reputation_group = reputation_group) bucket
FROM (
  SELECT RANGE_BUCKET(reputation, (SELECT * FROM range_array)) AS reputation_group, COUNT(*) AS count
  FROM `bigquery-public-data.stackoverflow.users` 
  WHERE reputation > 200000
  GROUP BY 1
  ORDER BY 1
) a
ORDER BY reputation_group

But if you are doing all that, it's easier to forget about RANGE_BUCKET:

WITH buckets AS (
  SELECT IFNULL(LAG(max) OVER(ORDER BY grp), -10000000) min, *
  FROM UNNEST([400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000]) max WITH OFFSET grp
)


SELECT buckets.min, buckets.max, COUNTIF(reputation >= buckets.min AND reputation < buckets.max) c
FROM `bigquery-public-data.stackoverflow.users`, buckets
WHERE reputation > 200000
GROUP BY 1,2
ORDER BY 1

Or:

SELECT IFNULL(min,0) min, max, COUNT(*) c
FROM (
  SELECT (SELECT MAX(x) FROM UNNEST(ranges) x WHERE x<reputation) min, (SELECT MIN(x) FROM UNNEST(ranges) x WHERE x>reputation) max
  FROM `bigquery-public-data.stackoverflow.users` 
  , (SELECT [400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000] ranges) 
  WHERE reputation > 200000
)
GROUP BY 1, 2
ORDER BY 1
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Thank you Felipe, your last query is also nice and working, I wish Stackoverflow allows multiple answers but they don't do that at the moment. – mdivk Feb 17 '20 at 13:23
  • No worries. It's true you can only accept one answer. But you could upvote all the answers you like. – Felipe Hoffa Feb 17 '20 at 15:40
  • thanks for reminding. :) if you are good with DataStudio, it would be appreciated if you can take a look here: https://stackoverflow.com/questions/60243829/how-to-exclude-others-in-my-pie-chart-in-data-studio – mdivk Feb 17 '20 at 15:46