How to query Google Cloud Load Balancer requests sinked by stackdrive via BigQuery by timestamp partition without load all data?

Question

I'm trying to query my dataset by timestamp but i'm billed for full dataset scan. Is it possible to query only for specific timestamp and reduce the size of data analyzed?

Note that i'm using Stackdrive Sink that stream data real time on my dataset. I don't have any control on the schema.

SELECT
  REGEXP_EXTRACT(httpRequest.requestUrl, r"^.+cid=([0-9]*)") as cid,
  REGEXP_EXTRACT(httpRequest.requestUrl, r"^.+pid=([0-9]*)") as pid
FROM
   `dataset_20190305` -- Date partition - no time
WHERE timestamp >=TIMESTAMP('2019-03-05 16:50:00')

Is dataset_20190305 a table which has a partition field or custom partition field? Can you provide more details about you bigquery setup — Tamir Klein, Mar 05 '19 at 18:03
@TamirKlein It's the automatic export from stackdrive log for HTTP load Balancer. Dataset - Requests_* (date partition) — Riccardo, Mar 06 '19 at 16:33

score 2 · Accepted Answer · answered Mar 05 '19 at 22:51

2

There are only two ways to reduce BQ query cost using only the where clause: partitioning and clustering. Judging by your dataset name, I suspect that your dataset only contains data relevant to March 5th. Unfortunately, BQ only supports day partitioning, so it won't help your case. There is no way to partition by any time block smaller than day. That being said, it may make your data cleaner if you simply create one table and partition it by day instead of creating a new dataset each day.

The other option, clustering, involves specifying a field and providing all potential values. For example, if you had a field "Color" of type String, you could set potential values as "Red", "Green", and "Blue", and therefore, reduce query size to a third.

If you're just looking to reduce cost and find yourself repeating a similar query on a specific table a lot, you can query it once and copy the results to a new table, which would essentially limit your query to a specified time range.

That being said, in general, try to follow BQ's recommendations, such as only selecting columns that you're interested in.

answered Mar 05 '19 at 22:51

Scicrazed

542
6
20

Dataset partition are automatically created by automatic export of stackdrive so i can't change it. Actually i have only one dataset with date partition. In dataset there are all log of HTTP request via loadbalancer. As i need to get specific stats at leat each 5 minutes i think it's not the best solution. I will move the export to Pub/Sub and then reduce my data to specifi table and partition. – Riccardo Mar 06 '19 at 15:54
I think that's a great idea! Pushing your data to a Timestamp partitioned table will be a lot more organized and should keep your query costs relatively low. – Scicrazed Mar 06 '19 at 21:10
Unfortunately even timestamp works only on Day. We tried to reduce data with PubSUB and DataFlow partitionig by DAY and clustering colums on Hour,Minute,Seconds. The query cost and data decreased but it's still unpredictable. With 1GB data the cluster worked only with first column Hour. But not on H,m or H,m,s. Cluster are still in beta and stilll works better with not too many insert. In this case we were using table with stream data. – Riccardo May 07 '19 at 14:19

How to query Google Cloud Load Balancer requests sinked by stackdrive via BigQuery by timestamp partition without load all data?

1 Answers1