7

I have a package data with some of its fields as following:

packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
     scanid--->string
     status--->string
scannedby--->string

Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?

Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.

Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
hmims
  • 539
  • 8
  • 28
  • 1
    Are you aware that BQ is append-only by design? Does this change your approach? Isn't your data as event based eg update is actually a new event? – Pentium10 Jan 25 '16 at 13:11
  • Hi pentium, my data is event based . For package id say. 140654070001 I have status as pending and then in next event update, the status is changed to complete. Now for this I have a new row, with same data with just a field change. So whenever any update comes for already existing package, I am removing the duplicates. Are you saying that I should keep on appending without duplicate removal? It seems fine to me too but then my concern shifts to data storage. There will be a lot of redundancy. – hmims Jan 25 '16 at 13:41
  • We have the same thing, and we keep all the versions of a entity lifetime. Storage costs are low. Or is it high for you? – Pentium10 Jan 25 '16 at 15:23

3 Answers3

10

Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.

Your view would look something like this:

SELECT *
FROM (
  SELECT
      *,
      MAX(<timestamp_column>)
          OVER (PARTITION BY <id_column>)
          AS max_timestamp,
  FROM <table>
)
WHERE <timestamp_column> = max_timestamp

(cribbed from here Return only the newest rows from a BigQuery table with a duplicate items)

If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.

e.g.

  • Add Data to TABLE_RAW.
  • Create view TABLE that performs the above query over TABLE_RAW
  • At some point after TABLE_RAW is stable, query TABLE with a destination table of TABLE, with write disposition WRITE_TRUNCATE.

Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.

Community
  • 1
  • 1
Sean Chen
  • 651
  • 4
  • 6
  • In this case, what would be more performant?: The query from your answer, or this: https://gist.github.com/anonymous/76458c2f55a713c79466fc889e806579 – devnull Feb 22 '17 at 18:51
3

You cannot update row in BigQuery table. You can only add one Overwriting table on each and every transaction - kind of doesn't make sense at all from any prospective

I would suggest just adding each and every transaction as new row.

Meantime, if for any reason (storage cost, query cost, query performance etc.) you want to dedup - you can do batch dedup periodically - let's say daily. In this case, having original data partitioned in daily tables will be beneficial. As at each moment you will need only latest Deduped Table and recent Daily table to query latest transaction. And previous days daily table can be deleted if you worry of storage cost

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

Biquery supports updates now here, and supports transactions also.

Firas Omrane
  • 894
  • 1
  • 14
  • 21