Eliminating duplicate records in a BigQuery Table

Question

I am planning to append incremental data on a daily basis to a BigQuery table. Each time I add incremental data to the existing table, I want to eliminate duplicate records (based on a primary key column) from the existing data in the table. One approach would be to -

Collect the set of keys from the incremental data (lets call it INCR_KEYS)
Run a query on the lines of - SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS) - and store the results in a new table.
Append the incremental data to the new table.

My concern with this approach is that it creates a duplicate copy of a big table and adds to my bills.

Is there a better way of achieving the same without creating a duplicate table?

How large is the table? If greater than 64MB compressed, #2 will fail. — Ryan Boyd, Sep 11 '12 at 02:59
@RyanBoyd - That shouldn't be an issue. If INCR_KEYS>64MB, I could split INCR_KEYS into smaller chunks and repeat <2> — user1659408, Sep 11 '12 at 03:28
The problem is if the full table (ie the data not in INCR_KEYS) is > 64MB. The SELECT query in #2 won't succeed. — Ryan Boyd, Sep 11 '12 at 03:34
@RyanBoyd - Hmm, checked the developers guide again. Have to figure out an alternative approach then. Thanks for pointing that out! — user1659408, Sep 11 '12 at 13:29
Re your R question. I agree we need better channels for you to ask that type of question. We're trying to figure out the best way for you to do that, while making sure that all technical/developer questions appear in SO. In the meantime, feel free to reach out via http://profiles.google.com/ryan.boyd on Google+ and ryguyrg on twitter. — Ryan Boyd, Sep 22 '12 at 04:40
I suck at stackoverflow, so I'm not sure what the best way to link these related questions, but here is our (BigQuery's) current recommendation on how to do this: http://stackoverflow.com/questions/34165094/return-only-the-newest-rows-from-a-bigquery-table-with-a-duplicate-items — Sean Chen, Feb 09 '16 at 07:50

score 6 · Answer 1 · answered Sep 10 '12 at 15:13

6

I din't know of a way to do this without creating a duplicate table -- this actually sounds like a pretty clever solution.

The incremental cost to you, however, is likely to be very small -- BigQuery only bills you for data for the length of time that it exists. If you delete the old table, you'd only need to pay for both tables for a period of seconds or minutes.

answered Sep 10 '12 at 15:13

Jordan Tigani

26,089
4
60
63

Thanks @Jordan, I can live with that! Would still be great if someone could suggest a solution without involving duplicate tables. – user1659408 Sep 11 '12 at 03:30

score 1 · Answer 2 · answered Apr 04 '14 at 05:41

You could run the query with the destination table set to the existing table and set the write disposition to truncate:

bq query --allow_large_results --replace --destination_table=mydataset.mytable \
    'SELECT * FROM mydataset.mytable
     WHERE key NOT IN (SELECT key FROM mydataset.update)'
bq cp --append_table mydataset.update mydataset.mytable

I believe this will work but I think it is worth taking a backup, especially as you can delete it soon after.

bq cp mydataset.mytable mydataset.backup
# You can also build the new table in one pass:
bq query --allow_large_results --replace --destination_table=mydataset.mytable \
    'SELECT * FROM (
         SELECT * FROM mydataset.mytable
         WHERE key NOT IN (SELECT key FROM mydataset.update)
     ), (
         SELECT * FROM mydataset.update
     )'
bq rm mydataset.backup

How exactly would this work and how are you filling mydataset.update? — David Ward, Oct 01 '14 at 21:09

score 0 · Answer 3 · edited Dec 02 '15 at 18:18

0

You can set a new destination table and just query a count and group by all columns:

SELECT
  FIELD1,
  FIELD2,
  FIELD3,
  FIELD4
FROM (
  SELECT
    COUNT (*),
    FIELD1,
    FIELD2,
    FIELD3,
    FIELD4
  FROM
    [<TABLE>]
  GROUP BY
    FIELD1,
    FIELD2,
    FIELD3,
    FIELD4)

edited Dec 02 '15 at 18:18

Bruce P

19,995
8
63
73

answered Dec 02 '15 at 18:09

Rich Reinheimer

1

Eliminating duplicate records in a BigQuery Table

3 Answers3

Linked