5

I'm trying to connect Hadoop running on Google Cloud VM to Google Cloud Storage. I have:

  • Modified the core-site.xml to include properties of fs.gs.impl and fs.AbstractFileSystem.gs.impl
  • Downloaded and referenced the gcs-connector-latest-hadoop2.jar in a generated hadoop-env.sh
  • authenticated via gcloud auth login using my personal account (instead of a service account).

I'm able to run gsutil -ls gs://mybucket/ without any issues but when I execute

hadoop fs -ls gs://mybucket/

I get the output:

14/09/30 23:29:31 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2 

ls: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token

Wondering what steps I am missing to get Hadoop to be able to see the Google Storage?

Thanks!

Dennis Huo
  • 10,517
  • 27
  • 43
Denny Lee
  • 3,154
  • 1
  • 20
  • 33

3 Answers3

9

By default, the gcs-connector when running on Google Compute Engine is optimized for using the built-in service-account mechanisms, so in order to force it to use the oauth2 flow, there are a few extra configuration keys that need to be set; you can borrow the same "client_id" and "client_secret" from gcloud auth as follows and add them to your core-site.xml, also disabling fs.gs.auth.service.account.enable:

<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>false</value>
</property>
<property>
  <name>fs.gs.auth.client.id</name>
  <value>32555940559.apps.googleusercontent.com</value>
</property>
<property>
  <name>fs.gs.auth.client.secret</name>
  <value>ZmssLNjJy2998hD4CTg2ejr2</value>
</property>

You can optionally also set fs.gs.auth.client.file to something other than its default of ~/.credentials/storage.json.

If you do this, then when you run hadoop fs -ls gs://mybucket you'll see a new prompt, similar to the "gcloud auth login" prompt, where you'll visit a browser and enter a verification code again. Unfortunately, the connector can't quite consume a "gcloud" generated credential directly, even though it can possibly share a credentialstore file, since it asks explicitly for the GCS scopes that it needs (you'll notice that the new auth flow will ask only for GCS scopes, as opposed to a big list of services like "gcloud auth login").

Make sure you've also set fs.gs.project.id in your core-site.xml:

<property>
  <name>fs.gs.project.id</name>
  <value>your-project-id</value>
</property>

since the GCS connector likewise doesn't automatically infer a default project from the related gcloud auth.

Dennis Huo
  • 10,517
  • 27
  • 43
  • Thanks for the information Dennis! As the instance was created using my own gmail account, how would I determine what my ClientID and secret key be? I tried using my gmail address and the verification code generated by "gcloud auth login" but its giving me a different error message: ls: No FileSystem for scheme: gs – Denny Lee Oct 01 '14 at 04:19
  • So, client id and client secret are actually not attributed to a gmail account, but are rather attached to a *project*; in this case, the "installed app" flow means the "client secret" is a bit of a misnomer. The literal 32555940559.apps.googleusercontent.com/ZmssLNjJy2998hD4CTg2ejr2 I provided are attributed with a Google-managed project associated with the Cloud SDK, and is the reason the auth flow mentions "Google Cloud SDK wants to access...". It doesn't get involved in actual access-control or billing, so using those values as-is suits most practical purposes. – Dennis Huo Oct 01 '14 at 18:59
  • In order to use a client id and client secret specific to your project, you would go to cloud.google.com/console under the project you're using, find "APIs & auth" -> "Credentials", find a box that says "Client ID for native application" or if it doesn't exist, click the "Create new Client ID" button with "Installed Application" as the type, and then use the provided client_id and client_secret there. – Dennis Huo Oct 01 '14 at 19:01
  • Thanks very much Dennis! Very helpful context! – Denny Lee Oct 02 '14 at 04:09
  • I have made the above said changes but I am still getting the below error. py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions. : java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token . How to get rid of this? – Ravi Ranjan Sep 13 '17 at 07:02
4

Thanks very much for both of your answers! Your answers led me to the configuration as noted in Migrating 50TB data from local Hadoop cluster to Google Cloud Storage.

I was able to utilize the fs.gs.auth.service.account.keyfile by generating a new service account and then applying the service account email address and p12 key.

Community
  • 1
  • 1
Denny Lee
  • 3,154
  • 1
  • 20
  • 33
  • 1
    Please don't add "thank you" as an answer. Instead, vote up the answers that you find helpful. – Trikaldarshiii Oct 01 '14 at 10:01
  • Please re-read my answer before deleting it. While I thanked both people for providing their helpful answers, my answer is different from theirs. I had also up voted both responses as their answers had led me to mine. – Denny Lee Oct 01 '14 at 16:02
1

It looks like the instance itself isn't configured to use the correct service account (but the gsutil command line utility is). The Hadoop file system adaptor looks like it's not pulling those credentials.

First, try checking if that instance is configured with the correct service account. If not, you can set it up.

Hope this helps!

ssk2
  • 188
  • 7
  • Thanks - will definitely check that! – Denny Lee Oct 01 '14 at 03:40
  • Oh, the issue that I'm running into is that I spun up the instances via the mesosphere config which was using my own gmail address. Is there a way to apply a service account to an instance(s) that is already created (looking at the documentation, it appears that I can only apply the service account at instance creation). – Denny Lee Oct 01 '14 at 04:17
  • I think using that second link you should be able to apply (or configure) a service account on instances after they've been created. – ssk2 Oct 01 '14 at 14:40