5

I have a 'vendors' table that looks like this...

**company itemKey itemPriceA itemPriceB**
companyA, 203913, 20, 10
companyA, 203914, 20, 20
companyA, 203915, 25, 5
companyA, 203916, 10, 10

It has potentially millions of rows per company and I want to query it to bring back a representative delta between itemPriceA and itemPriceB for each company. I don't care which delta I bring back as long as it isn't zero/null (like row 2 or 4), so I was using ANY_VALUE like this...

SELECT company
, ANY_VALUE(CASE WHEN (itemPriceA-itemPriceB)=0 THEN null ELSE (itemPriceA-itemPriceB) END)
FROM vendors
GROUP BY 1

It seems to be working but I notice 2 sentences that seem contradictory from Google's documentation...

"Returns NULL when expression is NULL for all rows in the group. ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected."

If ANY_VALUE returns null "when expression is NULL for all rows in the group" it should NEVER return null for companyA right (since only 2 of 4 rows are null)? But the second sentence sounds like it will indeed include the null rows.

P.s. you may be wondering why I don't simply add a WHERE clause saying "WHERE itemPriceA-itemPriceB>0" but in the event that a company has ONLY matching prices, I still want the company to be returned in my results.

d3wannabe
  • 1,207
  • 2
  • 19
  • 39
  • your query works. ANY_VALUE only retuns null if all values are null. In your case that means that there is no price difference in any item. Using avg instead of ANY_VALUE may give you more inside information. – Samuel Jan 30 '22 at 13:58
  • Thanks @Samuel - so why does the documentation say "rows for which expression is NULL are considered and may be selected". It sounds like the null rows may come back if they happen to be executed first? – d3wannabe Jan 30 '22 at 14:02
  • @Samuel you'll have to provide evidence of that when the docs say otherwise. From the [BigQuery Docs](https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#any_value): ANY_VALUE behaves as if RESPECT NULLS is specified; **rows for which expression is NULL are considered and may be selected.** – 149203 Aug 03 '22 at 20:44
  • @149203 You challenged me to back up my comment. This took longer, but I have now written a response. Hopefully this will not be torn apart because this is quite experimental in its approach to the question and the question itself is already half a year old. – Samuel Aug 13 '22 at 14:46

3 Answers3

7

Clarification

I'm afraid the accepted answer will have to show stronger evidence that contradicts the docs.

@Raul Saucedo suggests that the following BigQuery documentation is referring to WHERE clauses:

rows for which expression is NULL are considered and may be selected

This is not the case. WHERE clauses are not mentioned anywhere in the ANY_VALUE docs. (Nowhere on the page. Try to ctrl+f for it.) And the docs are clear, as I'll explain.

@d3wannabe is correct to wonder about this:

It seems to be working but I notice 2 sentences that seem contradictory from Google's documentation...

"Returns NULL when expression is NULL for all rows in the group. ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected."

But the docs are not contradictory. The 2 sentences coexist.

  1. "Returns NULL when expression is NULL for all rows in the group." So if all rows in a column are NULL, it will return NULL.
  2. "ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected." So if the column has rows mixed with NULLs and actual data, it will select anything from that column, including nulls.

How to create an ANY_VALUE without nulls in BigQuery

We can use ARRAY_AGG to turn a group of values into a list. This aggregate function has the option to INGORE NULLS. We then select 1 item from the list after ignoring nulls.

If we have a table with 2 columns: id and mixed_data, where mixed_data has some rows with nulls:

SELECT
    id,
    ARRAY_AGG( -- turn the mixed_data values into a list
        mixed_data -- we'll create an array of values from our mixed_data column
        IGNORE NULLS -- there we go!
        LIMIT 1 -- only fill the array with 1 thing
    )[SAFE_OFFSET(0)] -- grab the first item in the array
    AS any_mixed_data_without_nulls
FROM your_table
GROUP BY id

See similar answers here:

Update, 2022-08-12

There is evidence that the docs may be inconsistent with the actual behavior of the function. See Samuel's latest answer to explore his methodology.

However, we cannot know if the docs are incorrect and ANY_VALUE behaves as expected or if ANY_VALUE has a bug and the docs express the intended behavior. We don't know if Google will correct the docs or the function when they address this issue.

Therefore I would continue to use ARRAY_AGG to create a safe ANY_VALUE that ignores nulls until we see a fix from Google.

Please upvote the issue in Google's Issue Tracker to see this resolved.

149203
  • 186
  • 3
  • 10
3

This is an explanation about how “any_value works with null values”.

With any_value always return the first value, if there is a value different from null.

SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST([null, "banana",null,null]) as fruit;

Return null if all rows have null values. Refers at this sentence

“Returns NULL when expression is NULL for all rows in the group”

SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST([null, null, null]) as fruit
 

Return null if one value is null and you specified in the where clause. Refers to these sentences

“ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected.”

SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST(["apple", "banana", null]) as fruit
where fruit is null

Always depends which filter you are using and the field inside the any_value.

You can see this example, return two rows that are different from 0.

SELECT ANY_VALUE(e).company, (itemPriceA-itemPriceB) as value
FROM `vendor` e
where (itemPriceA-itemPriceB)!=0
group by e.company
Raul Saucedo
  • 1,614
  • 1
  • 4
  • 13
  • "With any_value always return the first value, if there is a value different from null." This is not true. From the [BigQuery Docs](https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#any_value): ANY_VALUE behaves as if RESPECT NULLS is specified; **rows for which expression is NULL are considered and may be selected.** – 149203 Aug 03 '22 at 20:18
3

The documentation says that "NULL are considered and may be" returned by an any_value statement. However, I am quite sure the documentation is wrong here. In the current implementation, which was tested on 13th August 2022, the any_value will return the first value of that column. However, if the table does not have an order by specified, the sorting may be random due to processing of the data on several nodes.

For testing a large table of nulls is needed. To generate_array will come handy for that. This array will have several entries and the value zero for null. The first 1 million entries with value zero are generated in the table tmp. Then table tbl adds before and after the [-100,0,-90,-80,3,4,5,6,7,8,9] the 1 million zeros. Finally, calculating NULLIF(x,0) AS x replaces all zeros by null.

Several test of any_value using the test table tbl are done. If the table is not further sorted, the first value of that column is returned: -100.

WITH
  tmp AS (SELECT ARRAY_AGG(0) AS tmp0 FROM UNNEST(GENERATE_ARRAY(1,1000*1000))),
  tbl AS (
  SELECT
    NULLIF(x,0) AS x,
    IF(x!=0,x,NULL) AS y, 
    rand() AS rand
  FROM
    tmp,
    UNNEST(ARRAY_CONCAT(tmp0, [0,0,0,0,0,-100,0,-90,-80,3,4,5,6,7,8,9] , tmp0)) AS x )

SELECT "count rows", COUNT(1) FROM tbl
UNION ALL SELECT "count items not null", COUNT(x) FROM tbl 
UNION ALL SELECT "any_value(x): (returns first non null element in list: -100)", ANY_VALUE(x) FROM tbl
UNION ALL SELECT "2nd run", ANY_VALUE(x) FROM tbl 
UNION ALL SELECT "3rd run", ANY_VALUE(x) FROM tbl
UNION ALL SELECT "any_value(y)",  ANY_VALUE(y) FROM tbl
UNION ALL SELECT "order asc", ANY_VALUE(x) FROM (Select * from tbl order by x asc)
UNION ALL SELECT "order desc (returns largest element: 9)", ANY_VALUE(x) FROM (Select * from tbl order by x desc)
UNION ALL SELECT "order desc", ANY_VALUE(x) FROM (Select * from tbl order by x desc)
UNION ALL SELECT "order abs(x) desc", ANY_VALUE(x) FROM (Select * from tbl order by abs(x) desc )
UNION ALL SELECT "order abs(x) asc (smallest number: 3)", ANY_VALUE(x) FROM (Select * from tbl order by abs(x) asc )
UNION ALL SELECT "order rand asc", ANY_VALUE(x) FROM (Select * from tbl order by rand asc )
UNION ALL SELECT "order rand desc", ANY_VALUE(x) FROM (Select * from tbl order by rand desc )

This gives following result:

enter image description here

  • The first not null entry, -100 is returned.
  • Sorting the table by this column causes the any_value to always return the first entry
  • In the last two examples, the table is ordered by random values, thus any_value returns random entries

If the dataset is larger than 2 million rows, the table may be internally split to be processed; this will result in a not ordered table. Without the order by command the first entry on the table and thus the result of any_value cannot be predicted. For testing this, please replace the 10th line by

 UNNEST(ARRAY_CONCAT(tmp0,tmp0,tmp0,tmp0,tmp0,tmp0,tmp0,tmp0, [0,0,0,0,0,-100,0,-90,-80,3,4,5,6,7,8,9] , tmp0,tmp0)) AS x )
Samuel
  • 2,923
  • 1
  • 4
  • 19
  • If you create an issue in Google's issue tracker, I'll upvote it. https://issuetracker.google.com/issues/new?component=187149&template=1162659&pli=1 – 149203 Aug 14 '22 at 16:58
  • There is the Google issue tracker: https://issuetracker.google.com/issues/242438603 – Samuel Aug 15 '22 at 08:27
  • Nice! Well done! I've upvoted and I'll update my answer. – 149203 Aug 15 '22 at 17:29