3

I tried 2 approaches to import a large table in Google BigQuery, about 50,000,000 rows, 18GB, into dataframe to Google Datalab, in order to do the machine learning using Tensorflow.

Firstly I use (all modules needed are imported) :

data = bq.Query('SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME`').execute().result().to_dataframe()

Then it keeps Running... until forever. Even though I do LIMIT 1000000, it doesn't change.

Secondly I use:

data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID')

It runs well at first, but when it goes to about 450,000 rows (calculate using percentage and total row count), it gets stuck at:

Got page: 32; 45.0% done. Elapsed 293.1 s.

And I cannot find how to enable allowLargeResults in read_gbq(). As its document says, I try:

data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID', configuration = {'query': {'allowLargeResult': True}})

Then I get:

read_gbq() got an unexpected keyword argument 'configuration'

That's how I even failed to import 1,000,000 rows to Google Cloud Datalab. I actually want to import 50 times the data size.

Any idea about it?

Thanks

Zrisound
  • 43
  • 1
  • 3

1 Answers1

4

Before loading large datasets into Google Cloud Datalab: Make sure to consider alternatives such as those mentioned in the comments of this answer. Use sampled data for the initial analysis, determine the correct model for the problem and then use a pipeline approach, such as Google Cloud Dataflow, to process the large dataset.

There is an interesting discussion regarding Datalab performance improvements when downloading data from BigQuery to Datalab here. Based on these performance tests, a performance improvement was merged into Google Cloud Datalab in Pull Request #339. This improvement does not appear to be mentioned in the release notes for Datalab but I believe that the fixes are included as part of Datalab 1.1.20170406. Please check the version of Google Cloud Datalab to make sure that you're running at least version 1.1.20170406. To check the version first click on the user icon in the top right corner of the navigation bar in Cloud Datalab then click About Datalab.

Regarding the pandas.read_gbq() command that appears to be stuck. I would like to offer a few suggestions:

  1. Open a new issue in the pandas-gbq repository here.
  2. Try extracting data from BigQuery to Google Cloud Storage in csv format, for example, which you can then load into a dataframe by using pd.read_csv. Here are 2 methods to do this:

Using Google BigQuery/Cloud Storage CLI tools: Using the bq command line tool and gsutil command line tool, extract data from BigQuery to Google Cloud Storage, and then Download the object to Google Cloud Datalab. To do this type bq extract <source_table> <destination_uris>, followed by gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/

Using Google Cloud Datalab

import google.datalab.bigquery as bq
import google.datalab.storage as storage
bq.Query(<your query>).execute(output_options=bq.QueryOutput.file(path='gs://<your_bucket>/<object name>', use_cache=False)).result()
result = storage.Bucket(<your_bucket>).object(<object name>).download()

Regarding the error read_gbq() got an unexpected keyword argument 'configuration', the ability to pass arbitrary key word arguments (configuration) was added in version 0.20.0. I believe this error is caused the fact that pandas is not up to date. You can check the version of pandas installed by running

import pandas
pandas.__version__

To upgrade to version 0.20.0, run pip install --upgrade pandas pandas-gbq. This will also install pandas-gbq which is an optional dependency for pandas.

Alternatively, you could try iterating over the table in Google Cloud Datalab. This works but its likely slower. This approach was mentioned in another StackOverflow answer here: https://stackoverflow.com/a/43382995/5990514

I hope this helps! Please let me know if you have any issues so I can improve this answer.

Anthonios Partheniou

Contributor at Cloud Datalab

Project Maintainer at pandas-gbq

Anthonios Partheniou
  • 1,699
  • 1
  • 15
  • 25
  • 1
    I wonder if it would be better to use datalab more like a notebook for quick analyzes and insightful exploration with sampled data rather than bringing 50 billion rows at once. Processing this large amount of data would require some cluster structure I imagine, not sure if datalab would be able to handle or scale up to process it (like working first with sampled data, finding the correct model for the problem and then using some cluster to process the whole data). – Willian Fuks Jul 03 '17 at 15:23
  • @Will I agree. This is definitely something that users should think about. Please feel free to submit your comment as a separate answer. Separately, if all rows are needed for analysis then potentially users can eliminate unnecessary columns to reduce the overall dataset size. – Anthonios Partheniou Jul 03 '17 at 15:31
  • 1
    I rather have you updating your answer with this information :), it's already good enough, maybe it just needs to warn OP about the approach of processing sampled data and then taking models and conclusions to a cluster for further processing. – Willian Fuks Jul 03 '17 at 15:37
  • Updated. Thanks @Will ! – Anthonios Partheniou Jul 03 '17 at 15:50