2

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.

Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.

Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.

Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.

Community
  • 1
  • 1
Piotr
  • 432
  • 5
  • 9
  • you've already solved it by partitioning. if table creation is an issue have an hourly cron that verifies today and tomorrow tables are always created. – Zig Mandel Nov 08 '15 at 01:40
  • I was hoping to do it as a serverless solution. If I go with a cron job, I need a server to run it, and monitoring to make sure it didn't die or else I'll lose data. All feasible but overhead I'd rather avoid if possible. – Piotr Nov 08 '15 at 02:04
  • not so if you use appengine cron – Zig Mandel Nov 08 '15 at 02:42

3 Answers3

3

If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.

I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:

  • Modifying the schema of the base table in non-backwards-compatible ways.
  • Modifying the schema of a created table directly in a way that is incompatible with the base table.
  • Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
  • Performing operations on the created table while it's actively being streamed to.

And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.

Sean Chen
  • 651
  • 4
  • 6
2

Most of us are doing the same thing as you described.

But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.

We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.

This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.

As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.

If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.

Pentium10
  • 204,586
  • 122
  • 423
  • 502
0

you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created. very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

Zig Mandel
  • 19,571
  • 5
  • 26
  • 36