8

As given in the below blog,

https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview

I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported Google Cloud Storage Connector and Google Cloud Storage as below,

// https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage
compile group: 'com.google.cloud', name: 'google-cloud-storage', version: '0.7.0'

// https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector
compile group: 'com.google.cloud.bigdataoss', name: 'gcs-connector', version: '1.6.0-hadoop2'

After that created a simple scala object file like below, (Created a sparkSession)

val csvData = spark.read.csv("gs://my-bucket/project-data/csv")

But it is throwing below error,

17/03/01 20:16:02 INFO GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2
17/03/01 20:16:23 WARN HttpTransport: exception thrown while executing request
java.net.SocketTimeoutException: connect timed out
    at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
    at sun.net.www.http.HttpClient.New(HttpClient.java:308)
    at sun.net.www.http.HttpClient.New(HttpClient.java:326)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158)
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:205)
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
    at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:317)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349)
    at test$.main(test.scala:41)
    at test.main(test.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

I have setup all the authentications as well. Not sure how the timeout is getting flashed.

Edit

I am trying to run above code through IntelliJ Idea (Windows). The JAR file for same code is working fine on Google Cloud DataProc but giving above error when I run it through local system. I have installed Spark,Scala,Google Cloud plugins in IntelliJ.

One more thing, I had created Dataproc instance and tried to connect to External IP address as given in the documentation, https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh

It was not able to connect to the server giving Timeout Error

Shawn
  • 537
  • 3
  • 7
  • 16

3 Answers3

7

Thank you Dennis for showing direction to the problem. Since I am using Windows OS, there is no core-site.xml because hadoop is not available for windows.

I have downloaded pre-built spark and in the code itself configured the parameter mentioned by you as given below

Created a SparkSession and using its variable configured hadoop parameter like spark.SparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile","<KeyFile Path>") and all other parameters which we need to setup in the core-site.xml.

After setting all these, Program could access the files from Google Cloud Storage.

Shawn
  • 537
  • 3
  • 7
  • 16
6

You need to set google.cloud.auth.service.account.json.keyfile to the local path of a json credential file for a service account you create following these instructions for generating a private key. The stack trace shows the connector thinks its on a GCE VM and is trying to obtain a credential from a local metadata server. If that doesn't work, try setting fs.gs.auth.service.account.json.keyfile instead.

When trying to SSH, have you tried gcloud compute ssh <instance name>? You may also need to check your Compute Engine firewall rules to make sure you're allowing inbound connections on port 22.

Dennis Huo
  • 10,517
  • 27
  • 43
  • I have downloaded json credential file for a service account key and set it to the Environmental Variable GOOGLE_APPLICATION_CREDENTIALS since I am using windows os and tried to run the program but I got the same TimeOut error. Ia hope I have taken correct way to implement suggestion given by you regarding google.cloud.auth.service.account.json.keyfile to the local path of a json file. If not then please correct me. I am not sure where to set fs.gs.auth.service.account.json.keyfile. If any document is available then please suggest on what all configurations are needed to work from windows os. – Shawn Mar 06 '17 at 08:34
  • When trying to SSH, I have tried gcloud compute ssh as mentioned by you but it gave me TimeOut error too. – Shawn Mar 06 '17 at 08:34
  • To my surprise, I am able to create bucket in the Google Cloud Storage using Storage class. Not sure what is wrong with reading the file from bucket. – Shawn Mar 06 '17 at 08:34
  • 1
    I setup the Eclipse on my Windows and tried to run same program and got below somewhat meaningful error `Exception in thread "main" java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token` – Shawn Mar 06 '17 at 13:35
  • I am getting results on Google Cloud SDK by running `gsutil ls gs://`. – Shawn Mar 06 '17 at 13:38
  • You should set `fs.gs.auth.service.account.json.keyfile` wherever you're setting `fs.gs.impl` and `fs.AbstractFileSystem.gs.impl`. Usually in a core-site.xml file. – Dennis Huo Mar 06 '17 at 17:52
  • @DennisHuo i am also getting the metadata error as this : java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token. Though hadoop fs -ls gs://crawl_tld_bucket/ works. What should I do? – Ravi Ranjan Sep 13 '17 at 06:11
  • @Dennis Huo.. what needs to be done to fix this issue. I am accessing the same and having similar issue now. Exception in thread "main" java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token – vikrant rana May 16 '20 at 14:17
  • 1
    If it shows that error it means the connector didn't successfully receive the config values telling it to look for the json keyfile, so it is falling back to assuming it is on a GCE VM instead; check to make sure your config values are being set correctly and making it into the connector. When logging is set to DEBUG level, the connector will print all the known config values during initialization. – Dennis Huo May 17 '20 at 16:43
0

You can directly read from GCS by setting below configuration in the spark context -

sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.enable","true")
sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.enable","true")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.email", "service account email value")
sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "project id value")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.private.key", "entire key value as is")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.private.key.id", "private key id value")