-1

A quick question to GBQ gurus.

Here are two queries that are identical in their purpose

first

SELECT
  fullVisitorId AS userid,
  CONCAT(fullVisitorId, visitStartTime) AS session,
  visitStartTime + (hits[
  OFFSET(0)].time / 1000) AS eventtime,
  date,
  trafficSource.campaign,
  trafficSource.source,
  trafficSource.medium,
  trafficSource.adContent,
  trafficSource.adwordsClickInfo.campaignId,
  geoNetwork.region,
  geoNetwork.city,
  trafficSource.keyword,
  totals.visits AS visits,
  device.deviceCategory AS deviceType,
  hits[OFFSET(0)].eventInfo.eventAction,
  hits[OFFSET(0)].TRANSACTION.transactionId,
  hits[OFFSET(0)].TRANSACTION.transactionRevenue,
  SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
  `some_site.ga_sessions_*`
WHERE
  ARRAY_LENGTH(hits) > 0
  AND _table_suffix BETWEEN '20200201'
  AND '20200201'
  AND fullVisitorId IN (
  SELECT
    DISTINCT(fullVisitorId)
  FROM
    `some_site.ga_sessions_*`,
    UNNEST(hits) AS hits
  WHERE
    _table_suffix BETWEEN '20200201'
    AND '20200201'
    AND (hits.TRANSACTION.transactionId != 'None')
)

second

SELECT
  fullVisitorId AS userid,
  CONCAT(fullVisitorId, visitStartTime) AS session,
  visitStartTime + (hits.time / 1000) AS eventtime,
  date,
  trafficSource.campaign,
  trafficSource.source,
  trafficSource.medium,
  trafficSource.adContent,
  trafficSource.adwordsClickInfo.campaignId,
  geoNetwork.region,
  geoNetwork.city,
  trafficSource.keyword,
  totals.visits AS visits,
  device.deviceCategory AS deviceType,
  hits.eventInfo.eventAction,
  hits.TRANSACTION.transactionId,
  hits.TRANSACTION.transactionRevenue,
  SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
  `some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
  _table_suffix BETWEEN '20200201' AND '20200201'
  AND fullVisitorId IN (
    SELECT
      DISTINCT(fullVisitorId)
    FROM
      `some_site.ga_sessions_*`,
      UNNEST(hits) AS hits
    WHERE
      _table_suffix BETWEEN '20200201'
      AND '20200201'
      AND (hits.TRANSACTION.transactionId != 'None')
  )

The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.

enter image description here

The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB

enter image description here

The amount of processed data is the same in both cases.

Now, the question is:

Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?

Thanks!

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
Dmitri Ilin
  • 87
  • 4
  • 13
  • At a glance, they're not identical. The OFFSET example just evaluates the first element of the hits array, the UNNEST example exposes more of the activities. Without understanding what you're after it's kind of an apples to oranges comparison. – shollyman Sep 28 '20 at 16:06
  • @shollyman thanks for commenting. `identical in their purpose` - that's the point. Both queries return the same number of rows & columns though the queries' code differs. I believe I understand what exactly `OFFSET` and `UNNEST` do. The question is why OFFSET performs better – Dmitri Ilin Sep 28 '20 at 16:48
  • As @shollyman mentioned, functions are not identical, beside the fact that we got the same output, they're logic differ (codded in different way). I would suggest to read official [documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unnest) and this [article](https://www.yuichiotsuka.com/google-bigquery-unnest-arrays/). – aga Sep 29 '20 at 14:25
  • @Ines yet again, I do understand that functions are NOT identical. My question is WHY offset works faster and better than unnest. – Dmitri Ilin Oct 02 '20 at 09:17
  • Offered a bounty for a canonical answer putting all my reputation on this. Wonder if someone could provide an explanation – Dmitri Ilin Oct 02 '20 at 09:19

1 Answers1

0

Let's consider following examples with using BigQuery public dataset.

UNNEST - returns 6 results:

WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h

OFFSET - returns 1 result:

WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t

Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.

The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

aga
  • 3,790
  • 3
  • 11
  • 18