4

I'm trying to download a table of about 250k rows and 500 cols from bigquery into R for some model building in h2o using the R wrappers. It's about 1.1gb when downloaded from BQ.

However, it runs for a long time and then looses the connection so never makes it to R (i'm rerunning now so i can get a more precise example of the error).

I'm just wondering if using bigrquery to do this seems like a reasonable task or is bigrquery mainly for pulling smaller datasets from BigQuery into R.

Just wondering if anyone has any tips and tricks that might be useful - am going through the library code to try figure out exactly how its doing it (was going to see if was an option to shared out the file locally or something even). But not entirely sure i even know what i'm looking at.

Update:

I've gone with quick fix of using the cli's to download the data locally

bq extract blahblah gs://blah/blahblah_*.csv

gsutil cp gs://blah/blahblah_*.csv /blah/data/

And then to read the data just use:

# get file names in case shareded accross multiple files
file_names <- paste(sep='','/blah/data/',list.files(path='/blah/data/',pattern=paste(sep='',my_lob,'_model_data_final')))

# read each file
df <- do.call(rbind,lapply(file_names,read.csv))

Is actually a lot quicker this way - 250k no problem.

I do find that BigQuery could do with a bit better integration into the wider ecosystem of tools out there. Love that R + Dataflow examples, defo going to look into that a bit more.

andrewm4894
  • 1,451
  • 4
  • 17
  • 37
  • It might be better to export the table in BigQuery to GCS, and then download the file(s) for input to your R model. – Graham Polley Jan 27 '17 at 01:01
  • Yeah was going to do that as a fall back. Was hoping to keep it all in a single R script. I suppose I could use the bq and gcloud cli via system calls in r. – andrewm4894 Jan 27 '17 at 07:24
  • Is it possible to do some pre-aggregation in BQ beforehand? – Graham Polley Jan 27 '17 at 07:27
  • No, is user level data for a loyal user model. Lots of dummy variables as figured better to push that to BQ (as the 250k is just a sample of much more users). Feels like R potentially is my bottle neck here but I'm using 8 core 50gb compute so was thinking be ok – andrewm4894 Jan 27 '17 at 07:32
  • Another approach could be to use Renjin + Dataflow. https://medium.com/google-cloud/cloud-dataflow-can-autoscale-r-programs-for-massively-parallel-data-processing-492b57bd732d#.lgwbmz2cj – Graham Polley Jan 27 '17 at 07:48
  • 2
    BigQuery is good at *processing* data from large datasets. R isn't. Ideally you would plug R into BigQuery rather than trying to extract all the data so it can be used in memory with R. This way BigQuery gets to do the calculations and R is just about sending the requests to BigQuery and only summarised data has to be returned. This is how analytics tools like Tableau work (and Tableau can use R as a plug in to set up more sophisticated stats analysis). – matt_black Jan 27 '17 at 12:44

0 Answers0