Spark read job from gcs object stuck

Question

I'm trying to read an object with a spark job locally. I previously created with another Spark job locally. When looking at the logs I see nothing weird, and in the spark UI the job is just stuck

Before I kick the read job I update the spark config as follows:

val hc = spark.sparkContext.hadoopConfiguration
hc.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hc.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hc.set("fs.gs.project.id", credential.projectId)
hc.set("fs.gs.auth.service.account.enable", "true")
hc.set("fs.gs.auth.service.account.email", credential.email)
hc.set("fs.gs.auth.service.account.private.key.id", credential.keyId)
hc.set("fs.gs.auth.service.account.private.key", credential.key)

Then I simply read like this

val path = "gs://mybucket/data.csv"
val options = Map("credentials" -> credential.base64ServiceAccount, "parentProject" -> credential.projectId)
spark.read.format("csv")
      .options(options)
      .load(path)

My service account has those permissions, I literally added all permissions I could find for Object storage

Storage Admin
Storage Object Admin
Storage Object Creator
Storage Object Viewer

This is how I previously wrote the object

val path = "gs://mybucket/data.csv"
val options = Map("credentials" -> credential.base64ServiceAccount, "parentProject" -> credential.projectId, "header" -> "true")
var writer = df.write.format("csv").options(options)
writer.save(path)

Those are my dependencies

Seq(
  "org.apache.spark" %% "spark-core" % "3.1.1",
  "org.apache.hadoop" % "hadoop-client" % "3.3.1",
  "com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.23.0",
  "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.4",
  "com.google.cloud" % "google-cloud-storage" % "2.2.1"
)

Any idea why would the write succeed but the read stuck like this?

Hi, can you check if this [stackoverflow answer](https://stackoverflow.com/a/31665921/15774177) helps? — Zeenath S N, Dec 09 '21 at 12:23
@ZeenathSN i don'i see any exception though the joib is stuck in running stage and not moved to fail, I see in the logs that file scan was reading from the right file path but nothing after that, so cannot tell where is the issue! — bachr, Dec 09 '21 at 18:16
@ZeenathSN read works now after I updated the gcs-connector and spark-bigquery-with-dependencies to latest versions — bachr, Dec 09 '21 at 21:32
If it is working now then can you provide it as an answer with some explanation of what you did, so that it helps the community? — Zeenath S N, Dec 13 '21 at 08:15

score 1 · Answer 1 · answered Dec 13 '21 at 19:57

1

I was using a version of the dependencies that was not the latest. Once I've updated google connector dependencies to the latest version (December 2021) I got the read working as well as the write from Google Storage.

answered Dec 13 '21 at 19:57

bachr

5,780
12
57
92

Spark read job from gcs object stuck

1 Answers1