Optimisation of BigQuery generated from bigrquery and dbplyr?

Question

dplyr functions are typically incredibly performant, having been optimised by the open source R community, with many going so far as to run c++ under the hood to make them much faster.

Does the BigQuery code generated through bigrquery and dbplyr receive any optimisation, or does it simply generate the sql however it can (unoptimised)? (note that both bigrquery and dbplyr, like dplyr are also tidyverse packages, and are both authored by the dplyr author Hadley Wickham)

Background

I'm interested in how optimised the generated BigQuery code is because I am trying to decide on is whether it is worth further optimising some batch processes written in bigrquery and dbplyr by manually rewriting some of the BigQuery code (rather than using those packages). If I am unlikely to see great performance improvements, I will not dedicate the time to do so.

Example query

This following is from the bigrquery readme

library(dplyr)

natality <- tbl(con, "natality")

natality %>%
  select(year, month, day, weight_pounds) %>% 
  head(10) %>%
  collect()

Have you reviewed the auto-generated `bigrquery`/`sql` code that is generated by `dbplyr`? — Simon.S.A., Nov 11 '19 at 04:06
@Simon.S.A. Great idea. I have tried that. What I see is complex BigQuery code so I cannot tell if it’s optimised or not — stevec, Nov 11 '19 at 04:32

score 0 · Answer 1 · answered Nov 11 '19 at 20:14

This probably depends on what you are optimizing for. E.g. run-time or pricing.

From my experience, focused on run-time:

No performance (run-time) concerns using the default generated query, though I have not tested this extensively.
dbplyr does include some basic query optimisation from version 1.0.0
Human written code is needed where humans will read code, machine written code is fine where only a machine will read the code.
I have noticed the greatest speed up when saving intermediate tables to disc during especially complex manipulates (multiple joins or lag/lead), with further speed up from adding an index to the intermediate tables (run-time drops to a quarter or less).
Also have reduced run-time by manually batching records. E.g. Filter to all even ID numbers, run and save analysis, repeat with all odd ID numbers, and append results together. But this is probably due to my using an older version of SQL.

However, I am using dbplyr with SQL server, not bigquery. So you'll have to judge how well it generalizes to your context.

Optimisation of BigQuery generated from bigrquery and dbplyr?

Background

Example query

1 Answers1