0

GBQ (Google Big Query) provides views for streaming insert meta data, see STREAMING_TIMELINE_BY_*. I would like to use this data to understand the billing for "Streaming Inserts". However, the numbers don't add ab and I'd like to understand if I made a mistake somewhere.


One of the data points in the streaming insert meta data view is the total_input_bytes:

total_input_bytes   

INTEGER 

Total number of bytes from all rows within the 1 minute interval.

In addition, the Pricing for data ingestion says:

Streaming inserts (tabledata.insertAll) 

$0.010 per 200 MB

You are charged for rows that are successfully inserted. Individual rows are calculated using a 1 KB minimum size.

So getting the costs for streamining inserts per day should be possible via

                 0.01/200 * (SUM(total_input_bytes)/1024/1024)

cost per 200 mb -----^

total bytes in mb ---------------------^

This should be the lower boundary as we disregard any rows with less than 1KB that are rounded up to 1KB.

Full query:

SELECT
 project_id,
 dataset_id,
 table_id,
 SUM(total_rows) AS num_rows,
 round(SUM(total_input_bytes)/1024/1024,2) AS num_bytes_in_mb,
 # 0.01$ per 200MB
 # @see https://cloud.google.com/bigquery/pricing#data_ingestion_pricing
 round(0.01*(SUM(total_input_bytes)/1024/1024)/200, 2) AS cost_in_dollar,
 SUM(total_requests) AS num_requests
FROM
 `region-us`.INFORMATION_SCHEMA.STREAMING_TIMELINE_BY_PROJECT
where 
  start_timestamp BETWEEN "2021-04-10" and "2021-04-14"
  AND error_code IS NULL
GROUP BY 1, 2, 3
ORDER BY table_id asc

However, the results are not reflected in our actual billing report. The billing shows less than half the costs of what I would expect:

Now I wondering if the costs can even be calculated like this.

Hirnhamster
  • 7,101
  • 8
  • 43
  • 73

2 Answers2

0

Your query is rounding every line that is less than 0.49kb to 0kb. This should explain why you are calculating less costs.

Try inserting a CASE statement that will handle these values:

SELECT
 project_id,
 dataset_id,
 table_id,
 SUM(total_rows) AS num_rows,
 CASE SUM(total_input_bytes)/1024/1024 < 0.001 THEN 0.001 ELSE
 round(SUM(total_input_bytes)/1024/1024,2) END AS num_bytes_in_mb,
 # 0.01$ per 200MB
 # @see https://cloud.google.com/bigquery/pricing#data_ingestion_pricing
CASE SUM(total_input_bytes)/1024/1024 < 0.001 THEN 0.001 ELSE
 round(0.01*(SUM(total_input_bytes)/1024/1024)/200, 2) END AS cost_in_dollar,
 SUM(total_requests) AS num_requests
FROM
 `region-us`.INFORMATION_SCHEMA.STREAMING_TIMELINE_BY_PROJECT
where 
  start_timestamp BETWEEN "2021-04-10" and "2021-04-14"
  AND error IS NULL
GROUP BY 1, 2, 3
ORDER BY table_id asc
kelyen
  • 202
  • 1
  • 8
  • It's the other way around: I get "too much" costs from the query - I should get less according to the billing report. The rounding is also applied AFTER the sum - so afaik it shouldn't affect the outcome (unless the sum would be less than 0.001) – Hirnhamster Apr 23 '21 at 12:09
0

I'm surprised by this. In fact I think cost from your query should be significantly lesser than the actual cost.

Consider the following 3 rows inserted via streaming in a 1 min window: Sample BigQuery Streaming data in a window

Your query will give us cost from the first two columns but the 3rd column should always be higher.

Were you able to figure this out?