Why Google BigQuery doesn't use partition date correctly when using views

Question

I have a date partitioned table (call it sample_table) with 2 columns, one to save dateTime in UTC and other to save timezone offset. I have a view on top of this table (call it sample_view). The view takes _partitiontime in from table and exposes that as partitionDate column and also there is another column customerDateTime which simply adds dateTime with timeZoneOffset.

When I query the sample_table directly using only _partitiontime bigquery scans far less data (131 MB).

select
  containerName,
  count(*)
from
  [sample_project.sample_table] 
where
  _partitiontime between timestamp('2016-12-12') and timestamp('2016-12-19')
  and customer = 'X'
  and containerName = 'XXX'
group by containerName
;

But when I run same query on the table with dateTime column to scan according to customer's local date time big query scans more (211MB). I expected less than 131MB or equal to 131MB.

select
  containerName,
  count(*)
from
  [sample_project.sample_table] 
where
  _partitiontime between timestamp('2016-12-12') and timestamp('2016-12-19')
  and DATE_ADD(dateTime, 3600, 'SECOND' ) between timestamp('2016-12-12 08:00:00') and timestamp('2016-12-19 15:00:00')
  and customer = 'X'
  and containerName = 'XXX'
group by containerName
;

When I run similar query against the sample_view with partitionDate bigquery scans more (399MB)

select
  containerName,
  count(*)
from
  [sample_project.sample_view] 
where
  partitionDate between timestamp('2016-12-12') and timestamp('2016-12-19')
  and customer = 'X'
  and containerName = 'XXX'
group by containerName
;

And when I run query against the view with partitionDate and use customerDateTime column as well bigquery scans even more (879MB)

select
  containerName,
  count(*)
from
  [sample_project.sample_view] 
where
  partitionDate between timestamp('2016-12-12') and timestamp('2016-12-19') and customerDateTime between timestamp('2016-12-12 08:00:00') and timestamp('2016-12-19 15:00:00')
  and customer = 'X'
  and containerName = 'XXX'
group by containerName
;

I'm not too sure whether I'm scanning right partitions from any of the queries above. Why do I see the differences between these queries? Is exposing _partitiontime as a new column partitionDate a bad strategy? I'm not sure how else to use the partition date within Tableau without writing more queries. Please let me know if you require more details.

does your table have repeated fields? – Mikhail Berlyant Dec 19 '16 at 19:45 — Mikhail Berlyant, Dec 19 '16 at 19:45
@MikhailBerlyant Yes it does. – opensourcegeek Dec 20 '16 at 07:02 — opensourcegeek, Dec 20 '16 at 07:02

score 0 · Answer 1 · answered Dec 19 '16 at 18:47

0

You will probably need to use standard SQL for the query instead, since legacy SQL has some limitations in terms of filter pushdown. I'm not very familiar with Tableau myself, but they have a help page for BigQuery that talks about switching between legacy and standard SQL.

answered Dec 19 '16 at 18:47

Elliott Brossard

32,095
2
67
99

This is interesting, after changing the view to support standard SQL the number of bytes scanned has gone down to 314MB, but it does take longer to run slightly (not sure why). Why does partitioning by date work differently in legacy SQL from standard SQL. I tried going through their docs but cannot get enough details. – opensourcegeek Dec 20 '16 at 13:25
Does this mean in order to use view with partition date, I'd have to use legacy SQL? – opensourcegeek Dec 21 '16 at 13:37
Standard SQL supports partitioned tables. Where did you see otherwise? – Elliott Brossard Dec 21 '16 at 14:25

score 0 · Answer 2 · answered Dec 20 '16 at 13:56

0

Just guess - the problem you see is because you have repeated fields. Legacy and Standard SQL deal differently with flattening result. Legacy SQL does flatten result thus you see not count of original records but rather number of repeated values in them. Whereas Standard SQL keep original structure. In Legacy SQL you need to take extra care of eliminating effect of Flattening, while in Standard SQL it is already taken care of

answered Dec 20 '16 at 13:56

Mikhail Berlyant

165,386
8
154
230

Thanks - do I have to use standard SQL to be able to use partition date effectively in my queries? – opensourcegeek Dec 20 '16 at 15:06
as I tried to make a point in my answer - the issue you see is most likely not really because of partitioned table! Both Legacy and Standard SQL support partitioned tables. Standard SQL will help you because it is more effectively deals with ARRAYS (repeated fields). – Mikhail Berlyant Dec 20 '16 at 15:50
Ok - thanks, but I still can't get the queries against my queries to scan less data when I use legacy SQL against view. In fact it doesn't look like it's using partition date at all when using view. – opensourcegeek Dec 20 '16 at 19:42
agree. this part might be related to the partition used in view. the problem in answering your question is that it is too broad. I would suggest to separate different issues into different questions and post them separately. this way you have more/better chances to get them addressed. so far - for the current question - that's all I can help you with – Mikhail Berlyant Dec 20 '16 at 19:48

Why Google BigQuery doesn't use partition date correctly when using views

2 Answers2