1

I have a Google BigQuery table that looks like this:

║ id ║   col_1    ║  col_2  ║ updated ║

║  1 ║ first_data ║ null    ║ 4/22    ║

║  1 ║ null       ║ old     ║ 4/23    ║

║  1 ║ null       ║ correct ║ 4/24    ║

I would like to construct a query that combines these rows and "overwrites" null columns if there is a row with the same id with the column not null. Essentially the result should look like:

║  1 ║ first_data ║ correct ║ 4/24    ║

If possible I would also like the result to represent a history:

║  1 ║ first_data ║ old     ║ 4/23    ║

║  1 ║ first_data ║ correct ║ 4/24    ║

But that is secondary and not necessary.

alamoot
  • 1,966
  • 7
  • 30
  • 50
Ben Reid
  • 35
  • 2
  • 5

2 Answers2

5

Below is for BigQuery Standard SQL

#standardSQL
SELECT id, 
  IFNULL(col_1, FIRST_VALUE(col_1 IGNORE NULLS) OVER(win)) col_1, 
  IFNULL(col_2, FIRST_VALUE(col_2 IGNORE NULLS) OVER(win)) col_2, 
  updated
FROM `project.dataset.your_table`
WINDOW win AS (PARTITION BY id ORDER BY updated DESC 
               ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
-- ORDER BY id, updated

You can test / play with it using dummy data as below

#standardSQL
WITH `project.dataset.your_table` AS (
  SELECT 1 id, 'first_data' col_1, NULL col_2,  '4/22' updated UNION ALL
  SELECT 1,     NULL,             'old',        '4/23'         UNION ALL
  SELECT 1,     NULL,             'correct',    '4/24'         UNION ALL
  SELECT 1,    'next_data',       NULL,         '4/25'         UNION ALL
  SELECT 1,     NULL,             NULL,         '4/26'         
)
SELECT id, 
  IFNULL(col_1, FIRST_VALUE(col_1 IGNORE NULLS) OVER(win)) col_1, 
  IFNULL(col_2, FIRST_VALUE(col_2 IGNORE NULLS) OVER(win)) col_2, 
  updated
FROM `project.dataset.your_table`
WINDOW win AS (PARTITION BY id ORDER BY updated DESC 
               ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
ORDER BY id, updated

with result

Row id  col_1       col_2   updated  
1   1   first_data  null    4/22     
2   1   first_data  old     4/23     
3   1   first_data  correct 4/24     
4   1   next_data   correct 4/25     
5   1   next_data   correct 4/26     
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • May I ask how to adapt this so no history is recorded e.g. only row 5 in the example above? – MarkeD Nov 27 '18 at 14:01
  • I ended up adding `ROW_NUMBER() OVER (PARTITION BY id) as n` and then filtered by `WHERE n=1` – MarkeD Nov 27 '18 at 14:28
  • @MarkeD - ROW_NUMBER() is quite a memory hog! in most cases can end up with resources exceeded error. if you can post your question - i will provide alternative solution (obviously format of comments do not allow to do so) – Mikhail Berlyant Nov 27 '18 at 15:59
  • Ok cool thanks I posted it here https://stackoverflow.com/questions/53508405/combine-rows-in-bigquery-standard-sql-ignoring-nulls-and-filter-to-fullest-row-i – MarkeD Nov 27 '18 at 21:27
0

How does this work?

select a.id, max(c.col_1)col_1, a.col_2,  a.updated  from yourtable a 
 left join (
    select id, col_1, max(updated) updated from yourtable
    where col_1 is not null
    group by id, col_1)b
on a.id=b.id and a.updated=b.updated
left join  yourtable c on a.id=c.id and a.updated<>c.updated
where a.col_2 is not null and c.col_1 is not null
group by a.id,   a.col_2,  a.updated 
order by updated  
Daniel Marcus
  • 2,686
  • 1
  • 7
  • 13