1

I think ARRAY_AGG function in BigQuery seems to have a bug in behaviour of ORDER BY. Here is some SQL to explain that:

#standardSQL
WITH t1 AS (
  SELECT *
  FROM UNNEST ( [
    STRUCT(1 AS user_id, 1 AS team_id, "2018-07-17" AS date_str),
    (  2, 1, "2018-07-17" ),
    (  3, 1, "2018-07-17" ),
    (  4, 1, "2018-07-17" ),
    (  5, 1, "2018-07-17" ),
    (  6, 1, "2018-07-17" ),
    (  7, 1, "2018-07-17" ),
    (  8, 2, "2018-07-17" ),
    (  9, 2, "2018-07-17" ),
    ( 10, 2, "2018-07-17" ),
    ( 11, 2, "2018-07-17" ),
    ( 14, 3, "2018-07-17" ),
    ( 15, 3, "2018-07-17" ),
    ( 16, 3, "2018-07-17" ),
    ( 17, 3, "2018-07-17" ),
    (  1, 1, "2018-07-18" ),
    (  4, 1, "2018-07-18" ),
    (  5, 1, "2018-07-18" ),
    (  6, 1, "2018-07-18" ),
    (  7, 1, "2018-07-18" ),
    (  8, 2, "2018-07-18" ),
    (  9, 2, "2018-07-18" ),
    ( 10, 2, "2018-07-18" ),
    ( 11, 2, "2018-07-18" ),
    ( 12, 2, "2018-07-18" ),
    ( 13, 2, "2018-07-18" ),
    ( 14, 3, "2018-07-18" ),
    ( 15, 3, "2018-07-18" ),
    ( 16, 3, "2018-07-18" ),
    ( 17, 3, "2018-07-18" ),
    ( 18, 3, "2018-07-18" ) ] ) )

SELECT
  date_str,
  ARRAY_AGG(teams ORDER BY users) AS a1,
  ARRAY_AGG(users ORDER BY users) AS a2,
  ARRAY_AGG(teams ORDER BY teams) AS a3,
  ARRAY_AGG(users ORDER BY teams) AS a4,
  ARRAY_AGG(STRUCT(teams, users) ORDER BY users) AS a5
FROM (
  SELECT
    date_str,
    users,
    COUNT(*) AS teams
  FROM (
    SELECT
      date_str,
      team_id,
      COUNT(*) AS users
    FROM t1
    GROUP BY date_str, team_id
  )
  GROUP BY date_str, users
)
GROUP BY date_str
ORDER BY date_str;

This query returns;

+-----+------------+----+----+----+----+----------+----------+
| Row | date_str   | a1 | a2 | a3 | a4 | a5.teams | a5.users |
+-----+------------+----+----+----+----+----------+----------+
|   1 | 2018-07-17 |  1 |  4 |  1 |  4 |        2 |        4 |
|     |            |  2 |  7 |  2 |  7 |        1 |        7 |
|   2 | 2018-07-18 |  1 |  5 |  1 |  5 |        2 |        5 |
|     |            |  2 |  6 |  2 |  6 |        1 |        6 |
+-----+------------+----+----+----+----+----------+----------+

But I expect as;

+-----+------------+----+----+----+----+----------+----------+
| Row | date_str   | a1 | a2 | a3 | a4 | a5.teams | a5.users |
+-----+------------+----+----+----+----+----------+----------+
|   1 | 2018-07-17 |  2 |  4 |  1 |  7 |        2 |        4 |
|     |            |  1 |  7 |  2 |  4 |        1 |        7 |
|   2 | 2018-07-18 |  2 |  5 |  1 |  6 |        2 |        5 |
|     |            |  1 |  6 |  2 |  5 |        1 |        6 |
+-----+------------+----+----+----+----+----------+----------+

It seems the ORDER BY clause in ARRAY_AGG function doesn't work properly since a1 and a4 are ordered wrongly.

Besides, it is hard to swallow that the query work exactly as expected when I replace either of two COUNT(*) parts with COUNT(user_id) or COUNT(team_id), which means;

SELECT
  date_str,
  ARRAY_AGG(teams ORDER BY users) AS a1,
  ARRAY_AGG(users ORDER BY users) AS a2,
  ARRAY_AGG(teams ORDER BY teams) AS a3,
  ARRAY_AGG(users ORDER BY teams) AS a4,
  ARRAY_AGG(STRUCT(teams, users) ORDER BY users) AS a5
FROM (
  SELECT
    date_str,
    users,
    COUNT(*) AS teams
  FROM (
    SELECT
      date_str,
      team_id,
      COUNT(user_id) AS users
    FROM t1
    GROUP BY date_str, team_id
  )
  GROUP BY date_str, users
)
GROUP BY date_str
ORDER BY date_str;

or

SELECT
  date_str,
  ARRAY_AGG(teams ORDER BY users) AS a1,
  ARRAY_AGG(users ORDER BY users) AS a2,
  ARRAY_AGG(teams ORDER BY teams) AS a3,
  ARRAY_AGG(users ORDER BY teams) AS a4,
  ARRAY_AGG(STRUCT(teams, users) ORDER BY users) AS a5
FROM (
  SELECT
    date_str,
    users,
    COUNT(team_id) AS teams
  FROM (
    SELECT
      date_str,
      team_id,
      COUNT(*) AS users
    FROM t1
    GROUP BY date_str, team_id
  )
  GROUP BY date_str, users
)
GROUP BY date_str
ORDER BY date_str;

To my understanding, these queries must return identical results to the original one in this condition. It is quite confusing for me. Might be a bug or something I misunderstand?


Some additional information.

The inner subquery;

SELECT
  date_str,
  users,
  COUNT(*) AS teams
FROM (
  SELECT
    date_str,
    team_id,
    COUNT(*) AS users
  FROM t1
  GROUP BY date_str, team_id
)
GROUP BY date_str, users

This returns;

+-----+------------+-------+-------+
| Row | date_str   | users | teams |
+-----+------------+-------+-------+
|   1 | 2018-07-18 |     5 |     2 |
|   2 | 2018-07-17 |     7 |     1 |
|   3 | 2018-07-18 |     6 |     1 |
|   4 | 2018-07-17 |     4 |     2 |
+-----+------------+-------+-------+

So creating this data directly by with clause and run the same aggregate query;

#standardSQL
With t2 AS (
  SELECT *
  FROM UNNEST ( [
    STRUCT("2018-07-18" AS date_str, 5 AS users, 2 AS teams),
    (  "2018-07-17", 7, 1 ),
    (  "2018-07-18", 6, 1 ),
    (  "2018-07-17", 4, 2 ) ] )
)

SELECT
  date_str,
  ARRAY_AGG(teams ORDER BY users) AS a1,
  ARRAY_AGG(users ORDER BY users) AS a2,
  ARRAY_AGG(teams ORDER BY teams) AS a3,
  ARRAY_AGG(users ORDER BY teams) AS a4,
  ARRAY_AGG(STRUCT(teams, users) ORDER BY users) AS a5
FROM t2
GROUP BY date_str
ORDER BY date_str;

The result became what I am looking for;

+-----+------------+----+----+----+----+----------+----------+
| Row | date_str   | a1 | a2 | a3 | a4 | a5.teams | a5.users |
+-----+------------+----+----+----+----+----------+----------+
|   1 | 2018-07-17 |  2 |  4 |  1 |  7 |        2 |        4 |
|     |            |  1 |  7 |  2 |  4 |        1 |        7 |
|   2 | 2018-07-18 |  2 |  5 |  1 |  6 |        2 |        5 |
|     |            |  1 |  6 |  2 |  5 |        1 |        6 |
+-----+------------+----+----+----+----+----------+----------+

I don't understand what cause this happen. I am completely puzzled. Any ideas or suggestions are appreciated.

mule
  • 41
  • 4

1 Answers1

1

Sorry if I'm misunderstanding but the default order by is ascending so it is sorting properly?

ARRAY_AGG(teams ORDER BY users desc) AS a1,
ARRAY_AGG(users ORDER BY users) AS a2,
ARRAY_AGG(teams ORDER BY teams) AS a3,
ARRAY_AGG(users ORDER BY teams desc) AS a4, 

If I change those to sort descending I get the desire result

Bobbylank
  • 1,906
  • 7
  • 15
  • Thanks for your reply.Yes, the example is a bit tricky and confusing itself. Let's take `a1` field for instance. It is an array of `teams` order by `users`. `teams` and `users` of the same `date_str` ordered the other way around in this example, which is clarified by `a5` field. So `teams` in `a1` are to be ordered *descendingly* when you order them by `users` *ascendingly*. Hope this explanation make sense for you. – mule Sep 14 '18 at 09:23