0

I have a set of sharded data similar to "Partitioning by date?" but some days do not have data and hence no days. An ideal solution would be to make all tables (dates) optional. Is that possible?

In detail: If you do a select from followed by a sequence [main.2013-08-01], [main.2013-08-02], ... every table must exist. If any of those tables do not exist, bigquery responds with a 500 error when attempting to execute the query. By using the term "optional" I am hoping for a syntax that would allow any particular table not to exist, but the query to still execute against the other tables in sum. An example of what I am hoping for: perhaps the query says SELECT * from ?[main.2013-08-01], ?[main.2013-08-02], ... but [main.2013-08-02] does not exist. Howevever, because it has the "?" indicator before the table it is considered optional and the query still executes successfully, and I get results, though only from [main.2013-08-01] and any other tables that actually do exist.

I am currently solving this problem by first doing a request for a dataset's table list and then comparing against the shards that I actually want to query for existence and merging the tables (e.g. I want all days in August, bigquery tells me that 7/27 doesn't exist, so my FROM block will contain a list of all data shards EXCEPT 7/27), but this method is complicated and involves two bigquery round trips, so it does not seem ideal.

Community
  • 1
  • 1
OverclockedTim
  • 1,713
  • 2
  • 12
  • 22

1 Answers1

0

As a workaround, have you tried having empty tables?

Let's say main.2013-08-01 exists, and main.2013-08-02 does not. You could:

SELECT * FROM [main.2013-08-01] WHERE any_column = 'non existing value'

That query will return 0 results, and you can select main.2013-08-02 as the destination table. This will create an empty table, with the same columns as main.2013-08-01. Then you can join without caring about emptiness!

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Thanks Fh, I suspect that would work as a workaround, but only if I knew in advance which tables were missing. In order to find which tables were missing, I would have to do something like my existing dataset table list query against all possible month-days and then add an additional step on top of that to execute the query that you suggest followed by my final query that I actually use to process the data. So in this particular case, this method is actually slower than the workaround I am currently using. – OverclockedTim Aug 28 '13 at 23:31