Cross account GCS access using Spark on Dataproc

Question

I am trying to ingest data in GCS of account A to BigQuery of account B using Spark running on Dataproc in account B.

I have tried to set GOOGLE_APPLICATION_CREDENTIALS to service account key file which allows access to necessary bucket in account A. But if I start spark-shell I get the following error.

Exception in thread "main" java.io.IOException: Error accessing Bucket dataproc-40222d04-2c40-42f9-a5de-413a123f949d-asia-south1

As per my understanding, setting the environment variable is switching the access from account B to account A.

Is there a way to have both the accesses within Spark i.e., default access to account B and additional access to account A?

Update: I tried running spark-shell with configuration as per Igor's Answer, but the error prevails. Here's the command I tried and the stack trace.

$ spark-shell --conf spark.hadoop.fs.gs.auth.service.account.json.keyfile=/home/shasank/watchful-origin-299914-fa29998bad08.json --jars gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar

Exception in thread "main" java.io.IOException: Error accessing Bucket dataproc-40999d04-2b99-99f9-a5de-999ad23f949d-asia-south1
  at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getBucket(GoogleCloudStorageImpl.java:1895)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1846)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfoInternal(GoogleCloudStorageFileSystem.java:1125)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1116)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.exists(GoogleCloudStorageFileSystem.java:440)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configureBuckets(GoogleHadoopFileSystemBase.java:1738)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.configureBuckets(GoogleHadoopFileSystem.java:76)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1659)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:683)
  at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:646)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3242)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
  at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:165)
  at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:146)
  at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:144)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:144)
  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironment$3.apply(SparkSubmit.scala:403)
  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironment$3.apply(SparkSubmit.scala:403)
  at scala.Option.map(Option.scala:146)
  at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(SparkSubmit.scala:403)
  at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:250)
  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by:
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException:
  403 Forbidden {
    "code" : 403,
    "errors" : [ {
      "domain" : "global",
      "message" : "ingestor@watchful-origin-299914.iam.gserviceaccount.com does not have storage.buckets.get access to dataproc-40999d04-2b99-99f9-a5de-999ad23f949d-asia-south1.",
      "reason" : "forbidden" } ],
    "message" : "ingestor@watchful-origin-299914.iam.gserviceaccount.com does not have storage.buckets.get access to  dataproc-40999d04-2b99-99f9-a5de-999ad23f949d-asia-south1." }
  at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:401)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1097)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:499)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
  at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getBucket(GoogleCloudStorageImpl.java:1889)
  ... 32 more

Did you authorize your service account to access to your bucket? Can you be precise when you speak about "account"? Do you soak about project or service account? — guillaume blaquiere, Aug 11 '19 at 05:02
Yes the service account is authorized to access the bucket. By "account" I mean "GCP account". More specifically account A is my customer's GCP account and account B is my GCP account where I am building a data lake. — Shasankar, Aug 11 '19 at 10:38
Hmm, try storage admin. My guess is the following: Sometime, libraries perform a bucket.list API call before getting the object, don't know why, and it's different according with the language. If you want, you can create a custom role only with bucket.list and bucket.get permission for reducing privileges — guillaume blaquiere, Aug 11 '19 at 10:48
I found an issue, seems to be related https://github.com/GoogleCloudPlatform/bigdata-interop/issues/135 — Dagang, Aug 11 '19 at 17:29
How about your GCE VMs' scopes? Sometimes roles aren't enough to interact with some components. Make sure that the Storage and BigQuery scopes are enabled in your GCE VMs. — Kevin Quinzel, Aug 12 '19 at 20:34
Thanks! Updated my answer with instructions on how to disable this bucket check. — Igor Dvorzhak, Aug 15 '19 at 00:21

Igor Dvorzhak · Answer 1 · 2019-08-15T15:13:38.740

3

To achieve this you need to re-configure GCS and BQ connectors to use different service accounts for authentication, by default both of them are using GCE VM service account.

To do so, please, refer to the Method 2 in the GCS connector configuration manual.

The same configuration applies to Hadoop BQ connector, but you need to replace fs.gs. prefix in the properties names with bq.mapred. prefix:

spark.hadoop.fs.gs.auth.service.account.json.keyfile=/path/to/local/gcs/key/file.json
spark.hadoop.bq.mapred.auth.service.account.json.keyfile=/path/to/local/bq/key/file.json

Update:

To disable Dataproc staging bucket check during GCS connector initialization, you need to use latest GCS connector version (1.9.17 at the moment) and set GCS connector system bucket property to empty string:

spark.hadoop.fs.gs.system.bucket=

Note, that this system bucket functionality is removed in upcoming GCS connector 2.0, so this will be not an issue going forward.

edited Aug 15 '19 at 15:13

answered Aug 11 '19 at 22:02

Igor Dvorzhak

4,360
3
17
31

I tried this, but when I use GCS key file for account A, DataProc looses access to its own GCS (which I believe it uses to store temp file). So it results in the same error which I originally posted. I don't think it will allow me to specify 2 keys files in spark.hadoop.fs.gs.auth.service.account.json.keyfile, I will try that though. – Shasankar Aug 12 '19 at 03:45
1

One service account with permissions in 2 projects is what you need. – Dagang Aug 12 '19 at 04:04
It is not 2 projects within a single GCP account. It is 2 different GCP account altogether. Is there a way to create such service account? – Shasankar Aug 12 '19 at 04:10
You can try to specify 2nd account during job submission not cluster creation. Also you [can set a different service account for the Dataproc cluster](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts) too. – Igor Dvorzhak Aug 12 '19 at 04:36
2

Also, setting `GOOGLE_APPLICATION_CREDENTIALS` could have an unintended consequences because it applies not only to GCS and BQ connectors, but to the all Google API client libraries. – Igor Dvorzhak Aug 12 '19 at 04:39
I am creating the cluster without any Service Account, so it is using the default and thus it has access to the Dataproc staging bucket. But when I start spark-shell --conf spark.hadoop.fs.gs.auth.service.account.json.keyfile=/path/to/accountA/serviceAccount.json, it completely forgets about the default service account and thus complains of not having access to the staging bucket. – Shasankar Aug 12 '19 at 05:15
1

May you post a full stack trace when it complains about access to staging bucket? – Igor Dvorzhak Aug 12 '19 at 17:56
@IgorDvorzhak The updated answer helps to start the `spark-shell` by bypassing the initial bucket check. But I come to the same problem while writing to BigQuery, since IndirectBigQueryOutputFormat requires a temporary write location on GCS. And for me this location has to be in account B (which is where destination BigQuery dataset is). So I think I would still require a way to have simultaneous access to GCS bucket on 2 different GCP account. – Shasankar Aug 16 '19 at 05:37
1

In this case you need to create a single service account that has access to both GCS and BQ that you want, because what are you trying to do is not supported at the moment – Igor Dvorzhak Aug 16 '19 at 15:14

Cross account GCS access using Spark on Dataproc

1 Answers1