Merge two losely related datasets in BigQuery

Question

I am new to BigQuery, and would appreciate your help. I am trying to keep my case generic so that it can be useful to others. I'm sure it's pretty basic and I am missing something...

I have 2 datasets, "Master" and "New". Master contains several columns, and already has lots of rows. I want to merge a new dataset into the master list, but the entries from this new dataset may or may not have values in all the columns. In addition, the values in the new list may or may not be already present in the master list. There is one unique / primary key in both (e.g. Col1) For example:

Master:

Col1 | Col2 | Col3 | Col4

Val11 | Val12 | Val13 | Val14

Val21 | Val22 | Val23 | Val24

Val31 | Val32 | Val33 | Val34

New:

Col1 | Col5 | Col6

Val11 | Val15 | null

Val41 | Val45 | Val46

Here are my questions:

1) what bigquery would I have to write in order to merge the two datasets, with the following requirements: a) values should be merged in records, e.g. when the previous record in master had 'null' somewhere, and the new dataset has a value for it, then it should appear there. b) if there is a 'conflict' then the old value should be overwritten by the new value

This is basically just a full outer join, I think, but BigQuery only supports Inner and Left Outer joins...

2) in case it is possible, using some more nested logic, I would like to be able to do 2b) in a more refined way, such as "if there is a conflict, and the original value is X, then overwrite, else don't overwrite/keep the origial value for that one column (other values for that record would be unaffected).

3) in case it is possible, using some more nested logic, I would like to be able to say something like "if val34 and val36 are both TRUE, then set val38 to TRUE as well, else if val37 is X then set val38 to X"

Now, these are 3 different but related questions, if you can help me on either, it would be really great. Btw, I am using bigquery, because the datasets are too large for a normal DB to handle, and because it is a very convenient tool that I can easily share within the entire team.

THANKS A LOT for your help and pointers!

p.s. here is the BigQuery Documentation https://developers.google.com/bigquery/query-reference

score 2 · Answer 1 · edited May 20 '14 at 10:36

1) I've filed a bug to support FULL OUTER JOIN. AFAIK it should be an easy change, but more investigation is necessary. If you don't want to wait, you can run three queries to get the left side, the right side, and the inner join. (you can append the results together, or you could create an outer query that wrapped the others as unioned tables to do it in a single pass).

1a:

SELECT col1, col2, col3 
FROM [tables.master]
WHERE col1 NOT IN (
  SELECT col1 FROM [tables.new])

1b:

SELECT col1, col5, col6
FROM [tables.new]
WHERE col1 NOT IN (
  SELECT col1 FROM [tables.master])

1c:

SELECT master.col1 as col1,     
  IF (new.col2 is not NULL, new.col2, master.col2) as col2, 
  IF (master.colI == X, new.colI, master.colI) as colI, 
  IF (master.colK AND master.colL, new.colJ, master.colJ) as colJ, 
FROM [tables.master] as master
INNER JOIN [tables.new] as new
ON master.col1 = new.col1

I think that the computation of col2 satisfies 1), colI satisfies 2), and colJ satisfies 3. If not, I may not understand what you're asking, and if you could clarify that would be great.

When/if FULL OUTER JOIN is implemented, you can get rid of the first two queries and change the third to a FULL OUTER instead of INNER JOIN.

good news: FULL OUTER JOINS are now supported! https://cloud.google.com/bigquery/release-notes#04162015 — Felipe Hoffa, Apr 27 '15 at 21:14

Merge two losely related datasets in BigQuery

1 Answers1