2

i'm querying a system that won't allow using DISTINCT, so my alternative is to do a GROUP BY to get near to a result

my desired query was meant to look like this,

SELECT 
SUM(column1) AS column1,
SUM(column2) AS column2,
COUNT(DISTINCT(column3)) AS column3
FROM table

for the alternative, i would think i'd need some type of nested query along the lines of this,

SELECT 
SUM(column1) AS column1,
SUM(column2) AS column2,
COUNT(SELECT column FROM table GROUP BY column) AS column3
FROM table

but it didn't work. Am i close?

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
user3768071
  • 727
  • 5
  • 12
  • 17

4 Answers4

3

You are using the wrong syntax for COUNT(DISTINCT). The DISTINCT part is a keyword, not a function. Based on the docs, this ought to work:

SELECT 
  SUM(column1) AS column1,
  SUM(column2) AS column2,
  COUNT(DISTINCT column3) AS column3
FROM table

Do, however, read the docs. BigQuery's implementation of COUNT(DISTINCT) is a bit unusual, apparently so as to scale better for big data. If you are trying to count a large number of distinct values then you may need to specify a second parameter (and you have an inherent scaling problem).

Update:

If you have a large number of distinct column3 values to count, and you want an exact count, then perhaps you can perform a join instead of putting a subquery in the select list (which BigQuery seems not to permit):

SELECT *
FROM (
    SELECT
      SUM(column1) AS column1,
      SUM(column2) AS column2
    FROM table
  )
  CROSS JOIN (
      SELECT count(*) AS column3
      FROM (
          SELECT column3
          FROM table
          GROUP BY column3
        )
    )

Update 2:

Not that joining two one-row tables would be at all expensive, but @FelipeHoffa got me thinking more about this, and I realized I had missed a simpler solution:

SELECT
  SUM(column1) AS column1,
  SUM(column2) AS column2,
  COUNT(*) AS column3
FROM (
    SELECT
      SUM(column1) AS column1,
      SUM(column2) AS column2
    FROM table
    GROUP BY column3
  )

This one computes a subtotal of column1 and column2 values, grouping by column3, then counts and totals all the subtotal rows. It feels right.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • thanks for the contribution so far guys... and right you are John Bollinger that... COUNT(DISTINCT column3) AS column3 ...is correct. at i first did something similar giving me same count but incorrect total. results were off by less than 2500 or so records. however, when i do GROUP BY the results are bang on. this is why i wanted to go with GROUP BY and then do a count all. i don't know why google big query is off like this. maybe it's because it's a big data architecture and it scoots around trawling through each record? – user3768071 Apr 09 '15 at 20:46
  • Did you read the documentation as I urged you to do? It explains exactly why your count might be off if the number of distinct values is very large. – John Bollinger Apr 09 '15 at 20:51
  • apologies John yes, you did explain that... sorry a bit late in the evening. ok, i'll read up on the documentation. such a pain because 'uniques' is a crucial metric. could you give more detail where you mention... you may need to specify a second parameter. – user3768071 Apr 09 '15 at 21:02
  • btw, i should probably share that when i first realized DISTINCT was all wonky, this threw an error saying it's not supported by Big Query: DISTINCT(column3) AS column3 ...but this gave me the same result as your fix: COUNT(DISTINCT(column3)) AS column3 – user3768071 Apr 09 '15 at 21:14
  • The BigQuery version of `COUNT(DISTINCT)` accepts a second argument that represents an upper bound on the result values that will be computed exactly. The default is 1000. If the count is greater, then the result is estimated. – John Bollinger Apr 09 '15 at 21:19
  • I have updated my answer with a possible alternative solution. It works around the restriction on subqueries in the select list by selecting from a (cross) join of two one-row subqueries, one of which itself selects from a subquery. BigQuery *does* accept subqueries in the `FROM` clause (a.k.a. inline views). – John Bollinger Apr 09 '15 at 21:20
  • John is right, to get a COUNT(DISTINCT) for big numbers of distinct values add a comma to the operation to ask for bigger values (as stated in the docs). Now, an alternative without involving JOINS is a sub-query. The original question almost had it, but I added an answer with that option. – Felipe Hoffa Apr 10 '15 at 15:22
1

FWIW, the way you are trying to use DISTINCT isn't how its normally used, as its meant to show unique rows, not unique values for one column in a dataset. GROUP BY is more in line with what I believe you are ultimately trying to accomplish.

Depending upon what you need you could do one of a couple things. Using your second query, you would need to modify your subquery to get a count, not the actual values, like:

SELECT 
SUM(column1) AS column1,
SUM(column2) AS column2,
(SELECT sum(1) FROM table GROUP BY column) AS column3
FROM table

Alternatively, you could do a query off your initial query, something like this:

SELECT sum(column1), sum(column2), sum(column4) from (
SELECT 
SUM(column1) AS column1,
SUM(column2) AS column2,
1 AS column4
FROM table GROUP BY column3)
GROUP BY column4

Edit: The above is generic SQL, not too familiar with Google Big Query

Duffmaster33
  • 1,160
  • 9
  • 16
  • Evidently BigQuery does not accommodate subqueries in the select list. Selected expressions must be "field names, literals, or functional expressions that operate on fields or literals." – John Bollinger Apr 09 '15 at 20:54
0

You can probably use a CTE

WITH result as (select column from table group by column)

SELECT 
SUM(column1) AS column1,
SUM(column2) AS column2,
Select Count(*) From result AS column3
FROM table
JustAPup
  • 1,720
  • 12
  • 19
0

Instead of doing a COUNT(DISTINCT), you can get the same results by running a GROUP BY first, and then counting results.

For example, the number of different words that Shakespeare used by year:

SELECT corpus_date, COUNT(word) different_words
FROM (
  SELECT word, corpus_date
  FROM [publicdata:samples.shakespeare]
  GROUP BY word, corpus_date
)
GROUP BY corpus_date
ORDER BY corpus_date

As a bonus, let's add a column that identifies which books were written during each year:

SELECT corpus_date, COUNT(word) different_words, GROUP_CONCAT(UNIQUE(corpus)) books
FROM (
  SELECT word, corpus_date, UNIQUE(corpus) corpus
  FROM [publicdata:samples.shakespeare]
  GROUP BY word, corpus_date
)
GROUP BY corpus_date
ORDER BY corpus_date
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325