I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
The BigQuery dataset that I'm working with comes from Hubspot. It's being kept in sync by Stitch. (For those unfamiliar with BigQuery, most integrations are append-only so I have to filter out old copies via the ROW_NUMBER() OVER
line you'll see below, so that's why it exists. Seems like the standard way to deal with this quirk.)
The wrinkle with the companies
table is every single field, except for two ID ones, is of type RECORD
. (See the screenshot at the bottom for an example). It serves to keep a history of field value changes. Unfortunately they don't seem to be in any order so wrapping up the fields - properties.first_conversion_event_name
for example - in a MIN()
or MAX()
and grouping by companyid formula doesn't work.
This is what I ended up with (the final query is much longer; I didn't include all of the fields in the sample below):
WITH companies AS (
SELECT
o.companyid as companyid,
ARRAY_AGG(STRUCT(o.properties.name.value, o.properties.name.timestamp) ORDER BY o.properties.name.timestamp DESC)[SAFE_OFFSET(0)] as name,
ARRAY_AGG(STRUCT(o.properties.industry.value, o.properties.industry.timestamp) ORDER BY o.properties.industry.timestamp DESC)[SAFE_OFFSET(0)] as industry,
ARRAY_AGG(STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) ORDER BY o.properties.lifecyclestage.timestamp DESC)[SAFE_OFFSET(0)] as lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) as seqnum
FROM `project.hubspot.companies` o) o
WHERE seqnum = 1
GROUP BY companyid)
SELECT
companyid,
name.value as name,
industry.value as industry,
lifecyclestage.value as lifecyclestage
FROM companies
The WITH
clause at the top is to get rid of the extra fields that the ARRAY_AGG(STRUCT())
includes. For each field I would have two columns - [field].value
and [field].timestamp
- and I only want the [field].value
one.
Thanks in advance!