Accessing Big Query from Cloud DataLab using Pandas

Question

I have a Jypyter Notebook accessing Big Query using Pandas as the vehicle:

df = pd.io.gbq.read_gbq( query, project_id = 'xxxxxxx-xxxx' )

This works fine from my local machine! (great, in fact!) But when I load the same notebook to Cloud DataLab I get:

DistributionNotFound: google-api-python-client

Which seems rather disappointing! I believe that the module should be installed with Pandas.. but somehow Google is not including it? It would be most preferable for a bunch of reasons to not have to change the code from what we develop on our local machines to what is needed in Cloud DataLab, in this case we heavily parameterize the data access...

Ok I ran:

!pip install --upgrade google-api-python-client

Now when I run the notebook I get an auth prompt that I cannot resolve since DataLab is on a remote machine:

Your browser has been opened to visit:
 >>> Browser string>>>>
If your browser is on a different machine then exit and re-run this
application with the command-line parameter 

 --noauth_local_webserver

Don't see an obvious answer to this?

I use the code suggested below by @Anthonios Partheniou from within the same notebook (executing it in a cell block) after updating the google-api-python-client in the notebook and I got the following traceback:

TypeError                                 Traceback (most recent call last)
<ipython-input-3-038366843e56> in <module>()
  5                            scope='https://www.googleapis.com/auth/bigquery',
  6                            redirect_uri='urn:ietf:wg:oauth:2.0:oob')
----> 7 storage = Storage('bigquery_credentials.dat')
  8 authorize_url = flow.step1_get_authorize_url()
  9 print 'Go to the following link in your browser: ' + authorize_url

/usr/local/lib/python2.7/dist-packages/oauth2client/file.pyc in __init__(self, filename)
 37 
 38     def __init__(self, filename):
---> 39         super(Storage, self).__init__(lock=threading.Lock())
 40         self._filename = filename
 41 

 TypeError: object.__init__() takes no parameters

He mentions the need to be executing the notebook from the same folder yet the only way that I know of for executing a datalab notebook is via the repo?

While the new module of using the new Jupyter Datalab module is a possible alternative The ability to use the full Pandas BQ interface unchanged on local and DataLab instances would be hugely helpful! So xing my fingers for a solution!

pip installed:
GCPDataLab 0.1.0
GCPData 0.1.0
wheel 0.29.0
tensorflow 0.6.0
protobuf 3.0.0a3
oauth2client 1.4.12
futures 3.0.3
pexpect 4.0.1
terminado 0.6
pyasn1 0.1.9
jsonschema 2.5.1
mistune 0.7.2
statsmodels 0.6.1
path.py 8.1.2
ipython 4.1.2
nose 1.3.7
MarkupSafe 0.23
py-dateutil 2.2
pyparsing 2.1.1
pickleshare 0.6
pandas 0.18.0
singledispatch 3.4.0.3
PyYAML 3.11
nbformat 4.0.1
certifi 2016.2.28
notebook 4.0.2
cycler 0.10.0
scipy 0.17.0
ipython-genutils 0.1.0
pyasn1-modules 0.0.8
functools32 3.2.3-2
ipykernel 4.3.1
pandocfilters 1.2.4
decorator 4.0.9
jupyter-core 4.1.0
rsa 3.4.2
mock 1.3.0
httplib2 0.9.2
pytz 2016.3
sympy 0.7.6
numpy 1.11.0
seaborn 0.6.0
pbr 1.8.1
backports.ssl-match-hostname 3.5.0.1
ggplot 0.6.5
simplegeneric 0.8.1
ptyprocess 0.5.1
funcsigs 0.4
scikit-learn 0.16.1
traitlets 4.2.1
jupyter-client 4.2.2
nbconvert 4.1.0
matplotlib 1.5.1
patsy 0.4.1
tornado 4.3
python-dateutil 2.5.2
Jinja2 2.8
backports-abc 0.4
brewer2mpl 1.4.1
Pygments 2.1.3

end

I couldn't produce the exception. Can you please run the following commands and reply with the oauth2client version? Run `import oauth2client` followed by `oauth2client.__version__` — Anthonios Partheniou, Jun 15 '16 at 19:30
Also run `import pip` followed by `for dist in pip.get_installed_distributions(): print dist` — Anthonios Partheniou, Jun 15 '16 at 19:35
I deployed a clean instance of datalab using https://datalab.cloud.google.com/ and the code I provided in my answer worked. One thing I noticed is that in my list of installed python modules I see `google-api-python-client 1.5.1` , but I didn't see `google-api-python-client` in the output that you provided. — Anthonios Partheniou, Jun 16 '16 at 00:57
Strange.. I'll try a stop and redeploy in the am, I have not attempted any mods, so not sure how I'm out of synch. Guess maybe just because I have not restarted in a while? Would it (google-api-client be something I need to install on datalab? — dartdog, Jun 16 '16 at 01:00
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/114794/discussion-between-anthonios-partheniou-and-dartdog). — Anthonios Partheniou, Jun 16 '16 at 01:02

score 7 · Accepted Answer · edited May 23 '17 at 12:10

7

Google BigQuery authentication in pandas is normally straight forward, except when pandas code is executed on a remote server. For example, running pandas on Datalab in the cloud. In that case, use the following code to create the credentials file that pandas needs to access Google BigQuery in Google Datalab.

from oauth2client.client import OAuth2WebServerFlow
from oauth2client.file import Storage
flow = OAuth2WebServerFlow(client_id='<Client ID from Google API Console>',
                           client_secret='<Client secret from Google API Console>',
                           scope='https://www.googleapis.com/auth/bigquery',
                           redirect_uri='urn:ietf:wg:oauth:2.0:oob')
storage = Storage('bigquery_credentials.dat')
authorize_url = flow.step1_get_authorize_url()
print 'Go to the following link in your browser: ' + authorize_url
code = raw_input('Enter verification code: ')
credentials = flow.step2_exchange(code)
storage.put(credentials)

Once you complete the process I don't expect you will see the error (as long as the notebook is in the same folder as the newly created 'bigquery_credentials.dat' file).

You also need to install the google-api-python-client python package as it is required by pandas for Google BigQuery support. You can run either of the following in a notebook to install it.

Either

!pip install google-api-python-client --no-deps
!pip install uritemplate --no-deps
!pip install simplejson --no-deps

or

%%bash
pip install google-api-python-client --no-deps
pip install uritemplate --no-deps
pip install simplejson --no-deps

The --no-deps option is needed so that you don't accidentally update a python package which is installed in datalab by default (to ensure other parts of datalab don't break).

Note: With pandas 0.19.0 (not released yet), it will be much easier to use pandas in Google Cloud Datalab. See Pull Request #13608

Note: You also have the option to use the (new) google datalab module inside of jupyter (and that way the code will also work in Google Datalab on the cloud). See the following related stack overflow answer: How do I use gcp package from outside of google datalabs?

edited May 23 '17 at 12:10

Community

1
1

answered Jun 14 '16 at 03:34

Anthonios Partheniou

1,699
1
15
25

Just tried it with no luck, updated post at the bottom with traceback Any help appreciated! – dartdog Jun 15 '16 at 18:50
Closer! I completely restarted the instance and ran the pip installs I then got the prompt to get the verification code but the link gave me a 404 Error: invalid_client The OAuth client was not found. Request Details access_type=offline scope=https://www.googleapis.com/auth/bigquery response_type=code redirect_uri=urn:ietf:wg:oauth:2.0:oob client_id= I have noticed I get some issues on parts of cloud like the repo and BQ and have to use incognito as it won't honor multi users on Chrome? I did try incognito to open this link also but no luck – dartdog Jun 16 '16 at 15:53
Please replace `` and `` with the client Id and secret that you get when you create credentials in the BigQuery management project – Anthonios Partheniou Jun 16 '16 at 16:17
I think it's under the permissions menu after you click on the BigQuery heading in Google Cloud Console – Anthonios Partheniou Jun 16 '16 at 16:23
found it under api's – dartdog Jun 16 '16 at 16:35
Way further 1st it said I needed to create Secret for native app not web app could not find so I selected "other" got a code entered but got this traceback:/usr/local/lib/python2.7/dist-packages/oauth2client/client.pyc in step2_exchange(self, code, http, device_flow_info) 2023 else: 2024 error_msg = 'Invalid response: %s.' % str(resp.status) -> 2025 raise FlowExchangeError(error_msg) 2026 2027 FlowExchangeError: invalid_grant – dartdog Jun 16 '16 at 16:46
Yes, I selected `other` as well. Create an OAuth client ID. You'll need to enter both ClientID and secret. I found a good resource here. https://developers.google.com/identity/sign-in/web/devconsole-project – Anthonios Partheniou Jun 16 '16 at 16:50
1

Second try the charm! it is working! Thanks so much!! whew need to add the bit about how to do the api key! – dartdog Jun 16 '16 at 17:01
Also rather strange, while it is working I do not see the bigquery_credentials.dat file in any of my visible storage buckets? Any hints? – dartdog Jun 16 '16 at 19:23
The file `bigquery_credentials.dat` should be generated in the same folder as the notebook that created the credentials – Anthonios Partheniou Jun 16 '16 at 19:30
But as I understand the notebook it transitory unless committed to the repo so maybe the storage is not accessible via a user interface? so there is some sort of working file someplace? I see it in the notebook interface. but not in a storage bucket? so could I commit it (a bad idea I'm sure!) – dartdog Jun 16 '16 at 19:31
You can commit the notebook without the client Id and secret. It is quick to generate the bigquery_credentials.dat file once you have the notebook and client Id/secret handy – Anthonios Partheniou Jun 16 '16 at 19:37
My curiosity is tweeked bit more as to the use of the temp (I guess) storage for working transitory (like a report generated out of a NB) files – dartdog Jun 16 '16 at 19:39

Accessing Big Query from Cloud DataLab using Pandas

1 Answers1