0

I have a partitioned table and am trying to limit my search to a few partitions. To do this I am running a query (using legacy SQL) that looks like the following:

SELECT
  *
FROM
  [project:dataset.table]
WHERE
  _PARTITIONTIME >= "2018-07-10 00:00:00"
  AND _PARTITIONTIME < "2018-07-11 00:00:00"
  AND col IN (
  SELECT
    col
  FROM
    [project:dataset.table]
  WHERE
    _PARTITIONTIME >= "2018-07-10 00:00:00"
    AND _PARTITIONTIME < "2018-07-11 00:00:00"
    AND col2 > 0)

I limit the main query and the subquery using _PARTITIONTIME, so big query should only need to search those partitions. When I run this query though I get billed as if I'd just queried the entire table without using _PARTITIONTIME. Why does this happen?

UPDATE The equivalent query using standard SQL does not have this problem, so use that as a workaround. I'd still like to know why this happens though. If it's just a bug or if legacy SQL actually does attempt to access all the data in a table for a query like this.

hamdog
  • 991
  • 2
  • 10
  • 24

2 Answers2

2

As noted in the question, switching to #standardSQL is the right solution. You shouldn't expect any big updates to the legacy SQL dialect - while #standardSQL will keep getting some substantial ones.

Also note that there are 2 types of partitioned tables today:

  1. Tables partitioned by ingestion time
  2. Tables that are partitioned based on a TIMESTAMP or DATE column

If you try to query the second type with legacy SQL:

SELECT COUNT(*)
FROM [fh-bigquery:wikipedia_v2.pageviews_2018]
WHERE datehour BETWEEN "2018-01-01 00:00:00" AND "2018-01-02 00:00:00" 

you get the error "Querying tables partitioned on a field is not supported in Legacy SQL".

Meanwhile this works:

#standardSQL
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE datehour BETWEEN "2018-01-01 00:00:00" AND "2018-01-02 00:00:00" 

I'm adding these points to enhance the message "it's time to switch to #standardSQL to get the best out of BigQuery".

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
2

I think this is a BigQuery Legacy SQL specific issue.

There is a list of cases for when Pseudo column queries scan all partitions and there is an explicit mentioning of Legacy SQL - In legacy SQL, the _PARTITIONTIME filter works only when ...
I don't see exactly your case in that list - but the best way is just use Standard SQL here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230