10

I am planning to append incremental data on a daily basis to a BigQuery table. Each time I add incremental data to the existing table, I want to eliminate duplicate records (based on a primary key column) from the existing data in the table. One approach would be to -

  1. Collect the set of keys from the incremental data (lets call it INCR_KEYS)
  2. Run a query on the lines of - SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS) - and store the results in a new table.
  3. Append the incremental data to the new table.

My concern with this approach is that it creates a duplicate copy of a big table and adds to my bills.

Is there a better way of achieving the same without creating a duplicate table?

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
user1659408
  • 101
  • 1
  • 3
  • How large is the table? If greater than 64MB compressed, #2 will fail. – Ryan Boyd Sep 11 '12 at 02:59
  • @RyanBoyd - That shouldn't be an issue. If INCR_KEYS>64MB, I could split INCR_KEYS into smaller chunks and repeat <2> – user1659408 Sep 11 '12 at 03:28
  • The problem is if the full table (ie the data not in INCR_KEYS) is > 64MB. The SELECT query in #2 won't succeed. – Ryan Boyd Sep 11 '12 at 03:34
  • @RyanBoyd - Hmm, checked the developers guide again. Have to figure out an alternative approach then. Thanks for pointing that out! – user1659408 Sep 11 '12 at 13:29
  • Re your R question. I agree we need better channels for you to ask that type of question. We're trying to figure out the best way for you to do that, while making sure that all technical/developer questions appear in SO. In the meantime, feel free to reach out via http://profiles.google.com/ryan.boyd on Google+ and ryguyrg on twitter. – Ryan Boyd Sep 22 '12 at 04:40
  • How did you end up solving this problem? – Derek Perkins Apr 21 '14 at 07:26
  • I suck at stackoverflow, so I'm not sure what the best way to link these related questions, but here is our (BigQuery's) current recommendation on how to do this: http://stackoverflow.com/questions/34165094/return-only-the-newest-rows-from-a-bigquery-table-with-a-duplicate-items – Sean Chen Feb 09 '16 at 07:50

3 Answers3

6

I din't know of a way to do this without creating a duplicate table -- this actually sounds like a pretty clever solution.

The incremental cost to you, however, is likely to be very small -- BigQuery only bills you for data for the length of time that it exists. If you delete the old table, you'd only need to pay for both tables for a period of seconds or minutes.

Jordan Tigani
  • 26,089
  • 4
  • 60
  • 63
  • Thanks @Jordan, I can live with that! Would still be great if someone could suggest a solution without involving duplicate tables. – user1659408 Sep 11 '12 at 03:30
1

You could run the query with the destination table set to the existing table and set the write disposition to truncate:

bq query --allow_large_results --replace --destination_table=mydataset.mytable \
    'SELECT * FROM mydataset.mytable
     WHERE key NOT IN (SELECT key FROM mydataset.update)'
bq cp --append_table mydataset.update mydataset.mytable

I believe this will work but I think it is worth taking a backup, especially as you can delete it soon after.

bq cp mydataset.mytable mydataset.backup
# You can also build the new table in one pass:
bq query --allow_large_results --replace --destination_table=mydataset.mytable \
    'SELECT * FROM (
         SELECT * FROM mydataset.mytable
         WHERE key NOT IN (SELECT key FROM mydataset.update)
     ), (
         SELECT * FROM mydataset.update
     )'
bq rm mydataset.backup
0

You can set a new destination table and just query a count and group by all columns:

SELECT
  FIELD1,
  FIELD2,
  FIELD3,
  FIELD4
FROM (
  SELECT
    COUNT (*),
    FIELD1,
    FIELD2,
    FIELD3,
    FIELD4
  FROM
    [<TABLE>]
  GROUP BY
    FIELD1,
    FIELD2,
    FIELD3,
    FIELD4)
Bruce P
  • 19,995
  • 8
  • 63
  • 73