0

BigQuery's documentation on cost optimization states that:

BigQuery can provide incredible performance because it stores data as a columnar data structure. This means SELECT * is the most expensive way to query data. This is because it will perform a full query scan across every column present in the table(s), including the ones you might not need.

However I could not find anywhere in the documentation that discuss whether the BigQuery engine optimizes subqueries/virtual views that use SELECT * or not. Examples:

(1)

CREATE VIEW my_view AS ( SELECT * FROM my_table );

SELECT a, b FROM my_view -- does this only access {a, b} or all columns?

(2)

SELECT a, b FROM (SELECT * FROM my_table) AS t -- does this only access {a, b} or all columns?

So the question is: Does BigQuery optimize subquery/view SELECT * to minimize costs, and if so, does it do it reliably?

Philippe Hebert
  • 1,616
  • 2
  • 24
  • 51

2 Answers2

2

EDIT: Confirmed by Yun Zhang's answer.

It seems from my fiddling around in the BigQuery editor in the GCP console that the query plan execution details reflect the final needs of the top level query rather than the SELECT * in a subquery.

Adding a field to the top-level SELECT statement:

  • changes the marker at the top-right which estimates how much data will be processed
  • results in a different execution detail breakdown where the input statement related to the table read in the subquery is adjusted according to which field have been queried at the top level.

For instance, if I query

SELECT a, b FROM (SELECT * FROM my_table) AS t

then the execution plan will write something akin to:

S00: Input
READ $10:a, $11:b
FROM my_table

whereas

SELECT a, b, c FROM (SELECT * FROM my_table) AS t

will result in the following:

S00: Input
READ $10:a, $11:b, $12:c
FROM my_table

Thereby giving weak proof that BQ optimizes fields read by inner SELECT * statements based on the top level fields requested.

Philippe Hebert
  • 1,616
  • 2
  • 24
  • 51
1

Does BigQuery optimize subquery/view SELECT * to minimize costs,

Yes. And you don't need to guess. Query below uses public dataset that you'll see same stats from your project.

Full table query:

This query will process 280.6 MB when run.

enter image description here

SubQuery (view will be the same)

This query will process 12.6 MB when run.

enter image description here

and if so, does it do it reliably?

Yes. This is guaranteed.

Yun Zhang
  • 5,185
  • 2
  • 10
  • 29
  • Thanks for your answer! :) I see from doing some research about you that you are in the core Bigquery team. Can you add this as part of your answer? Cheers! – Philippe Hebert Jan 12 '21 at 00:32