3

I would like to use the BigQuery streaming insert API to insert rows into a BigQuery table. In my use case, this is the only class of "write" operation that will ever be performed; I will never insert rows by any other means, and I will never update or delete rows.

Having read Life of a BigQuery streaming insert, my understanding is that even after an insertAll request has been acked, it is possible that subsequent read operations may not "see" all the rows that were successfully inserted by it until they have been transferred from the streaming buffer into managed storage.

Assuming my understanding is correct (please let me know if not!):

  1. Is there any way to be able to tell when all the rows inserted by a given insertAll request have now been committed to managed storage?
  2. Even after a row has been committed to managed storage, are subsequent read operations guaranteed to either see them, or fail?

Why am I asking these questions? My table schema will include a "key" column, which is a strictly monotonically increasing identifier. All read operations will be constrained to rows whose keys do not exceed some specified value. I need a guarantee that a read operation of this form will always return the same results, assuming the key specified corresponds to a row that has already been inserted. Clearly, no rows inserted subsequently could possibly be returned by such a query (because their keys would be greater than the key specified). However, if there are any rows that have already been inserted but are still in the streaming buffer, the query might not return them, but the exact same query performed later (after the rows have been committed) would return them. In my use case, this would be a disaster.

1 Answers1

0

There are two states.

In streaming buffer and committed. You can read the rows which are in streaming buffer, this syntax varies between a non partitiones and partitioned tables, you can find examples how to get these rows.

If the insertId was used by more then one row only one is persisted.

The committed state guarantees you can read.

Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • Thanks! I think you've answered the second part of my question -- i.e. once a row is committed, subsequent (successful) read operations are indeed guaranteed to see it. However, unless there is some way for me to be able to detect that a row has been committed, this doesn't help me much (if there is in fact no way of doing this with BigQuery, then that's a perfectly valid answer of course). – Daniel Patton Dec 16 '18 at 00:21
  • (I've added some more context to the question which hopefully illustrates why I need to be able to tell when rows have been committed.) – Daniel Patton Dec 16 '18 at 00:24
  • What you want is easy to achieve, find and keep the olderst key in your streaming buffer as a reference. That means that all prior keys have been committed and issue a query that will use that as a threshold – Pentium10 Dec 16 '18 at 12:21
  • That makes sense, but I'm not clear on how I can tell which rows are still in the streaming buffer? Is there some way to determine this using an API call? – Daniel Patton Dec 16 '18 at 13:25
  • Eiter you check the `streamingBuffer.oldestEntryTime` via API field can be leveraged to identify the age of records in the streaming buffer. Or write queries, there are different syntax for non partitioned and partitioned columns: https://stackoverflow.com/questions/41864257/how-to-query-for-data-in-streaming-buffer-only-in-bigquery – Pentium10 Dec 16 '18 at 16:30