Getting large datasets into Cloud Datalab

Question

Is it possible to get large datasets into a pandas DataFrame?

My dataset is approx. 1.5 Gb uncompressed (input for clustering), but when I try and select the contents of the Table using bq.Query(...) it throws an exception:

RequestException: Response too large to return. Consider setting allowLargeResults to true in your job configuration. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors

Looking at https://cloud.google.com/bigquery/querying-data?hl=en which states,

You must specify a destination table.

It feels like the only place to send large queries is another Table (and then click export to GCS and download).

There will also be a (possibly large write back) as the classified rows are written back to the database.

The same dataset runs fine on my 16Gb Laptop (matter of minutes) but I am looking at migrating to Datalab as our data moves to the cloud.

Thank you very much, any help appreciated

score 1 · Accepted Answer · answered Jan 12 '16 at 01:37

1

If you already have your results in a Table you can just use Table.to_dataframe()

Otherwise you will need to run a Query using execute() with a destination table name specified as you noted, and allow_large_results=True parameter (following which you can do the to_dataframe() call as above).

Note that you may have issues with this; the default VM that runs the Python kernel is pretty basic. You can deploy a more capable VM using URL parameters; e.g.:

In the mean time, as mentioned you can deploy Datalab to a larger VM by some URL parameters. For example:

http://datalab.cloud.google.com?cpu=2&memorygb=16

answered Jan 12 '16 at 01:37

Graham Wheeler

2,734
1
19
23

Thank you. What found is Table.to_dataframe() runs but doesn't seem to return, with progress / loading bar on screen for a while. I compared this with manually exporting, via BQ web interface, to a gzipped CSV on Google Cloud Storage, which loaded 4m+ rows into a pandas DataFrame in a matter of seconds (with 16gb instance as suggested). – Chris Jan 13 '16 at 15:32
Yes, Table.to_dataframe() is not going to be very fast; it is doing to read each row of the Table in Python, parse it and create a Python dictionary, and finally take the list of dictionaries and create a dataframe, which is a lot of Python work on every row of the data. You can use Table.extract in Datalab to export to GCS which is done without the Python kernel having to act as an intermediary, and then use the GCS object as you mentioned. – Graham Wheeler Jan 13 '16 at 18:42
*Table.extract has worked* I appreciate Datalab is in Beta and have some feedback, if helpful - 1. I got exceptions with `job not found` when I had BigQuery in different projects (so reading from project1:Dataset.Table to project2:gs://some-bucket) because Datalab is not available in the EU right now, so I needed a second project. Using one project for everything solved these exceptions. 2. the UI in DataLab is still showing the "Running" box, although the file is written in GCS and can be read in a different notebook. Thank you very much for your help. – Chris Jan 15 '16 at 06:40

Getting large datasets into Cloud Datalab

1 Answers1