Splitting a table in BigQuery for faster result

Question

We find that when we run queries with UDFs on large inputs they tend to fail very often with an "internal error". One thought for making the situation less frequent is to split the table before running the query. But, the results are still intermittent -- sometimes the query succeeds, sometimes it fails (the exact same query on the exact same input).

So the question is, is it generally more reliable and/or faster to run this query:

SELECT field1, field2 FROM (SELECT field1, field2 FROM some_udf( SELECT field1, field2 FROM table_with_300_MM_rows WHERE hash(some_unique_key) % {n} = {table_id_1}) ), .... ( SELECT field1, field2 FROM some_udf( SELECT field1, field2 FROM table_with_300_MM_rows WHERE hash(some_unique_key) % {n} = {table_id_n}) ),

Instead of this?

SELECT field1, field2 FROM some_udf( SELECT field1, field2 FROM table_with_300_MM_rows)

And if so, for what value of n? (how many sub-tables do we have to split for optimal performance)? Our understanding is that this should not happen, as it is probably related to a UDF error and if a UDF succeeds on (individually) all the inputs that are split up there is no reason it should not succeed in the whole input.

Assume that the query is such that both methodologies above result in the same exact output.

Hi, this is the same issue you reported in http://stackoverflow.com/questions/37216672/bigquery-request-timed-out-error-on-script. Please see my comment there. — thomaspark, May 17 '16 at 17:02

score 1 · Answer 1 · answered May 16 '16 at 05:43

You certainly shouldn't get an internal error. Any time that happens, it's our fault--at the very least, we should plumb a proper error message through.

Also, it's fair to expect any query that works for a small amount of data to continue to work as the table grows in size. We're in charge of scaling and partitioning, so you don't have to worry about it.

We have seen a few similar UDF scaling issues recently and will be investigating them soon. Hopefully we'll fix the issue on our end, and you won't have to worry about this anymore.

That said, if splitting the table helps get the query to go through for the time being, go for it. The downside is that you'll get charged twice for two scans of the table, so if this is a common operation, you might want to split your table into two permanent tables, and then run separate queries (or subqueries) over those smaller tables.

Splitting a table in BigQuery for faster result

1 Answers1