I'm trying to download a table of about 250k rows and 500 cols from bigquery into R for some model building in h2o using the R wrappers. It's about 1.1gb when downloaded from BQ.
However, it runs for a long time and then looses the connection so never makes it to R (i'm rerunning now so i can get a more precise example of the error).
I'm just wondering if using bigrquery to do this seems like a reasonable task or is bigrquery mainly for pulling smaller datasets from BigQuery into R.
Just wondering if anyone has any tips and tricks that might be useful - am going through the library code to try figure out exactly how its doing it (was going to see if was an option to shared out the file locally or something even). But not entirely sure i even know what i'm looking at.
Update:
I've gone with quick fix of using the cli's to download the data locally
bq extract blahblah gs://blah/blahblah_*.csv
gsutil cp gs://blah/blahblah_*.csv /blah/data/
And then to read the data just use:
# get file names in case shareded accross multiple files
file_names <- paste(sep='','/blah/data/',list.files(path='/blah/data/',pattern=paste(sep='',my_lob,'_model_data_final')))
# read each file
df <- do.call(rbind,lapply(file_names,read.csv))
Is actually a lot quicker this way - 250k no problem.
I do find that BigQuery could do with a bit better integration into the wider ecosystem of tools out there. Love that R + Dataflow examples, defo going to look into that a bit more.