BigQuery SQL Query Optimization

Question

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).

The BigQuery dataset that I'm working with comes from Hubspot. It's being kept in sync by Stitch. (For those unfamiliar with BigQuery, most integrations are append-only so I have to filter out old copies via the ROW_NUMBER() OVER line you'll see below, so that's why it exists. Seems like the standard way to deal with this quirk.)

The wrinkle with the companies table is every single field, except for two ID ones, is of type RECORD. (See the screenshot at the bottom for an example). It serves to keep a history of field value changes. Unfortunately they don't seem to be in any order so wrapping up the fields - properties.first_conversion_event_name for example - in a MIN() or MAX() and grouping by companyid formula doesn't work.

This is what I ended up with (the final query is much longer; I didn't include all of the fields in the sample below):

WITH companies AS (
SELECT
    o.companyid as companyid,
    ARRAY_AGG(STRUCT(o.properties.name.value, o.properties.name.timestamp) ORDER BY o.properties.name.timestamp DESC)[SAFE_OFFSET(0)] as name,
    ARRAY_AGG(STRUCT(o.properties.industry.value, o.properties.industry.timestamp) ORDER BY o.properties.industry.timestamp DESC)[SAFE_OFFSET(0)] as industry,
    ARRAY_AGG(STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) ORDER BY o.properties.lifecyclestage.timestamp DESC)[SAFE_OFFSET(0)] as lifecyclestage
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) as seqnum
  FROM `project.hubspot.companies` o) o
WHERE seqnum = 1
GROUP BY companyid)

SELECT
  companyid,
  name.value as name,
  industry.value as industry,
  lifecyclestage.value as lifecyclestage
FROM companies

The WITH clause at the top is to get rid of the extra fields that the ARRAY_AGG(STRUCT()) includes. For each field I would have two columns - [field].value and [field].timestamp - and I only want the [field].value one.

Thanks in advance!

Schema Screenshot

score 0 · Answer 1 · answered Feb 21 '19 at 05:37

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).

Based on schema you presented and assuming your query really returns what you expect - below "optimized" version should return same result

#standardSQL
WITH companies AS (
  SELECT
      o.companyid AS companyid,
      STRUCT(o.properties.name.value, o.properties.name.timestamp) AS name,
      STRUCT(o.properties.industry.value, o.properties.industry.timestamp) AS industry,
      STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) AS lifecyclestage
  FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) AS seqnum
    FROM `project.hubspot.companies` o
  ) o
  WHERE seqnum = 1
)
SELECT
  companyid,
  name.value AS name,
  industry.value AS industry,
  lifecyclestage.value AS lifecyclestage
FROM companies

As you can see I just simply removed GROUP BY companyid because you already have just one entry/row per companyid after you apply WHERE seqnum = 1, so no reason at all to group just one row per companyid. For the very same reason I removed ARRAY_AGG( ORDER BY)[SAFE_OFFSET(0)] - it just aggregated one struct and then extracted that one element out of array - so no need in doing this

BigQuery SQL Query Optimization

1 Answers1