We find that when we run queries with UDFs on large inputs they tend to fail very often with an "internal error". One thought for making the situation less frequent is to split the table before running the query. But, the results are still intermittent -- sometimes the query succeeds, sometimes it fails (the exact same query on the exact same input).
So the question is, is it generally more reliable and/or faster to run this query:
SELECT field1, field2
FROM (SELECT field1, field2 FROM some_udf(
SELECT field1, field2 FROM table_with_300_MM_rows
WHERE hash(some_unique_key) % {n} = {table_id_1})
),
....
(
SELECT field1, field2 FROM some_udf(
SELECT field1, field2 FROM table_with_300_MM_rows
WHERE hash(some_unique_key) % {n} = {table_id_n})
),
Instead of this?
SELECT field1, field2 FROM some_udf(
SELECT field1, field2 FROM table_with_300_MM_rows)
And if so, for what value of n? (how many sub-tables do we have to split for optimal performance)? Our understanding is that this should not happen, as it is probably related to a UDF error and if a UDF succeeds on (individually) all the inputs that are split up there is no reason it should not succeed in the whole input.
Assume that the query is such that both methodologies above result in the same exact output.