Google Bigquery inconsistent when variable names changes in ORDER BY clause

Question

My goal is to test if the grp's generated by one query, are the same grp's as the output of the same query. However, when I change a single variable name, I get different results.

Below I show an example of the same query where we know the results are the same. However, if you run this group, you will find one query produces different results than another.

SELECT grp
FROM
(
  SELECT CONCAT(word, corpus) AS grp, rank1, rank2 
  FROM (
    SELECT
      word, corpus,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY test1 DESC) AS rank1,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
    FROM 
    (
      SELECT *, (word_count * word_count * corpus_date) AS test1
      FROM [bigquery-public-data:samples.shakespeare]
    )
  )
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN 
(
  SELECT grp FROM (
    SELECT CONCAT(word, corpus) AS grp, rank1, rank2
    FROM
    (
      SELECT
        word, corpus,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
      FROM 
      (
        SELECT *, (word_count * word_count * corpus_date) AS test2
        FROM [bigquery-public-data:samples.shakespeare]
      )
    )
  )
  WHERE rank1 <= 3 OR rank2 <= 3
)

Far worse... now if you try running the exact same query, but simply change the variable name test1 to test3, you will get completely different results.

SELECT grp
FROM
(
  SELECT CONCAT(word, corpus) AS grp, rank1, rank2 
  FROM (
    SELECT
      word, corpus,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY test3 DESC) AS rank1,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
    FROM 
    (
      SELECT *, (word_count * word_count * corpus_date) AS test3
      FROM [bigquery-public-data:samples.shakespeare]
    )
  )
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN 
(
  SELECT grp FROM (
    SELECT CONCAT(word, corpus) AS grp, rank1, rank2
    FROM
    (
      SELECT
        word, corpus,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
      FROM 
      (
        SELECT *, (word_count * word_count * corpus_date) AS test2
        FROM [bigquery-public-data:samples.shakespeare]
      )
    )
  )
  WHERE rank1 <= 3 OR rank2 <= 3
)

I can think of no explanation that satisfies both of these bizarre behaviors and this is preventing me from being able to validate my data. Any ideas?

EDIT:

I've updated the BigQuery SQL in the way the responses would suggest, and the same inconsistencies occur.

Obviously it would be different. In first case you are ordering by column test1. In second case it is column test3! — I_am_Batman, Apr 20 '16 at 19:20
I don't understand your answer, care to explain? Those two columns are the same. I simply changed the name of the variable, which is the point of the question. — cgnorthcutt, Apr 20 '16 at 19:27

score 2 · Accepted Answer · answered Apr 21 '16 at 09:53

2

The problem is nondeterminism in your row numbering.

There are many examples in this table where (word_count * word_count * corpus_date) is the same for several corpuses. So when you partition by word and order by test2, the ordering you use for assigning row numbers is nondeterministic.

When you run the same subquery twice within the same top-level query, BigQuery actually executes that subquery twice and may yield different results between the two runs due to that nondeterminism.

Changing the alias might have just caused your query to not hit in the cache, resulting in a different set of nondeterministic choices and different amount of overlap between the results.

You can confirm this by changing the ORDER BY clause in your analytic functions to include corpus. For example, change ORDER BY test2 to ORDER BY test2, corpus. Then the row numbering will be deterministic, and the queries will return zero results regardless of what aliases you use.

answered Apr 21 '16 at 09:53

Jeremy Condit

6,766
1
28
30

This is great - thanks @Jeremy Condit. This is very clear. Many thanks to all for getting to the bottom of this. – cgnorthcutt Apr 22 '16 at 00:51
Went ahead and updated the question title to reflect your answer. – cgnorthcutt Apr 22 '16 at 00:52
unfortunately this is not full truth you were looking for - see http://stackoverflow.com/q/36781049/5221944 and http://stackoverflow.com/q/36781520/5221944 – Mikhail Berlyant Apr 22 '16 at 00:58
Based on that link, it does appear that aliases can affect ordering of equal values (in other words, cgn's results might have been stable even without the cache), but the same general answer still applies: BigQuery will pick an arbitrary ordering when sorting equal values, and differences in the ordering chosen explain cgn's results. – Jeremy Condit Apr 23 '16 at 06:29

Gordon Linoff · Answer 2 · 2016-04-21T02:53:38.833

1

I don't understand the question. SQL syntax in general and BigQuery in particular are both quite clear: aliases defined in the SELECT cannot be used in the SELECT for other expressions. As explained in the BigQuery documentation:

Aliases defined in a SELECT clause can be referenced in the GROUP BY, HAVING, and ORDER BY clauses of the query, but not by the FROM, WHERE, or OMIT RECORD IF clauses nor by other expressions in the same SELECT clause. [emphasis mine]

Hence, your query would only work if test1, test2, and test3 are columns in the Shakespeare table. There is no reason to think that such columns would have similar values, so I wouldn't expect the queries to return the same results.

EDIT:

If we assume that the documentation is incorrect, then the problem is probably duplicates in the order by criteria for row_number(). Sorting in SQL is not stable -- that means that two rows with the same sort key value can appear in any order during a sort. Even the same query can return different results on two runs. SQL sorts are obviously not stable, because tables have no inherent ordering among the rows (ordering is only specified by columns).

So, all that is happening is that different rows with the same sort-key values are being chosen. I don't think this is related to the aliases.

How can you fix this? Add an addition sort key, such as id, to the sort as the final key. Alternatively use rank() or dense_rank() and figure out explicitly what to do with the duplicates.

edited Apr 21 '16 at 02:53

answered Apr 20 '16 at 19:41

Gordon Linoff

1,242,037
58
646
786

Both the queries I provided run, without error, in Google Bigquery. But I appreciate your reference to the documentation. – cgnorthcutt Apr 20 '16 at 19:49
Unfortunately, your answer also fails. I edited to show a working example of what you suggested. If you try changing all variables named test1 to test3, it will produce different results. – cgnorthcutt Apr 20 '16 at 22:50
The aliases are a red herring, Gordon Linoff is right that it is most likely due to your use of row_number(). Ordering is the one way in which sql queries are non-deterministic and so can produce different results on different runs. – oulenz Apr 21 '16 at 07:47
1

@oulenz . . . I haven't used BigQuery very much and don't naturally re-use aliases this way. Ironically, earlier today I learned that there are some cases where BigQuery actually *requires* re-use of aliases in the `SELECT`. The documentation seems a bit off. – Gordon Linoff Apr 22 '16 at 03:21
@gordon-linoff I have seen some very weird errors with aliases in BigQuery. (E.g. claiming that a certain field can not be found.) Fortunately, the BigQuery team has promised this will be fixed in their upcoming overhaul of their sql dialect. – oulenz Apr 22 '16 at 08:20
@oulenz . . . They have promised to fix a lot of things. – Gordon Linoff Apr 24 '16 at 20:23

score 1 · Answer 3 · answered Apr 20 '16 at 20:50

I noticed you are always asking tough questions and then you are tough on accepting or even voting for answer. That’s Ok! And I want to try again so let’s go to subject:

Looks like using aliases in the same SELECT statement is undocumented and not supported Note below in SELECT clause documentation:

Each expression can be given an alias by adding a space followed by an identifier after the expression. The optional AS keyword can be added between the expression and the alias for improved readability. Aliases defined in a SELECT clause can be referenced in theGROUP BY, HAVING, and ORDER BY clauses of the query, but not by the FROM, WHERE, or OMIT RECORD IF clauses nor by other expressions in the same SELECT clause.

Thus, there is strange behavior here without throwing error. So you can use it on your own risk but better not (still would be great to hear from Google Team – but as it is not supported - you can expect no much info explaining this behavior)

Meantime - I would propose just follow what is supported and transform your query to below "stable" version.
It doesn't have problem that you face in your original one!
(note I’ve changed the WHERE clause in first subquery – otherwise it always returns zero rows – which makes total sense)

SELECT grp
FROM
(
  SELECT CONCAT(word, corpus) AS grp, rank2, 
    ROW_NUMBER() OVER (PARTITION BY word ORDER BY [try_any_alias_1] DESC) AS rank1
  FROM (
    SELECT
      word, corpus,
      (word_count * word_count * corpus_date) AS [try_any_alias_1],
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
    FROM [bigquery-public-data:samples.shakespeare]
  )
)
WHERE rank1 <= 3 OR rank2 <= 4 // if rank2 <= 3 as in second subquery - result is always empty as expected
HAVING grp NOT IN 
(
  SELECT grp FROM (
    SELECT CONCAT(word, corpus) AS grp, rank2,
      ROW_NUMBER() OVER (PARTITION BY word ORDER BY [try_any_alias_2] DESC) AS rank1
    FROM
    (
      SELECT
        word, corpus,
        (word_count * word_count * corpus_date) AS [try_any_alias_2],
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
        ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
      FROM [bigquery-public-data:samples.shakespeare]
    )
  )
  WHERE rank1 <= 3 OR rank2 <= 3
)

Hi Mikhail! Thanks for your response :) I've updated my code to do as both you and Gordon suggested. There are no aliases used in SELECT statements. It still produced different results when you change the variable name. It still says two SQL queries produce different results. I agree your example seems to work, but then if your answer is correct, you should be able to run the code in my question without error. Check it out! — cgnorthcutt, Apr 21 '16 at 00:10
1) looks like I pin-pointed the issue so I think I know how to make your code work with producing predictable and consistent result 2) if my guess on source of issue correct - sure I cannot explain why this issue exist as it is more for google team to answer; So now, it is up to you - are you really interested in practical solution that resolves your current issue? or you just looking for answers from google? I am asking because i don't want further waste my time if you anyway say that you want google's answers as you already did before with http://stackoverflow.com/a/35643085/5221944 — Mikhail Berlyant, Apr 21 '16 at 00:47
Hi Mikhail, How can you be sure that it is producing a predictable and consistent result unless we try every variable name? What I've done is shown that the principles by which you created your answer (to separate the alias from when it is used in SELECT) fails to work in general. So even though I wasn't able to guess an input that would cause an error with your answer, I have no grounds to assume it works in general, because I've shown an example where separating the alias from its use fails to be consistent. Does that make sense? — cgnorthcutt, Apr 21 '16 at 01:10
You are too generic–which is more appropriate in forums.SO is not forum. Here we have question and answer for specific practical issues to address. As I mentioned already in my previous comment - I FOUND cause of this issue and (despite of statement in google documentation of not using aliases within same select) it is not because of use of alias. I am glad it is not because I am actually using this trick in my practice a lot. At least while developing and experimenting. Anyway, it seems you are not interested in practical answers but rather in back and forth conversation. Sorry if I am wrong. — Mikhail Berlyant, Apr 21 '16 at 01:43
I've shown how the answer you provided is not reliable. I'm interested in finding the truth which correctly answers the question, nothing else. — cgnorthcutt, Apr 21 '16 at 01:49

Google Bigquery inconsistent when variable names changes in ORDER BY clause

3 Answers3