0

I'm using cloudera quickstart vm. I started playing around with google cloud platform yesterday. I'm trying to copy data in cloudera hdfs to 1. google cloud storage (gs://bucket_name/) 2. google cloud hdfs cluster (using hdfs://google_cluster_namenode:8020/)

  1. I set up service account authentication and configured my cloudera core-site.xml as instructed in this post

    hadoop fs -cp hdfs://quickstart.cloudera:8020/path_to_copy/ gs://bucket_name/
    

works fine. However, I'm not able to use distcp to copy to google cloud storage. I get the following error. I know it's not a URI issue. Is there anything else I'm missing?

Error: java.io.IOException: File copy failed: hdfs://quickstart.cloudera:8020/path_to_copy/file --> gs://bucket_name/file
at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:284)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:252)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) 
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://quickstart.cloudera:8020/path_to_copy/file to gs://bucket_name/file
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:280)
... 10 more 
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: gs://bucket_name.distcp.tmp.attempt_1461777569169_0002_m_000001_2
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:116)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.getTmpFile(RetriableFileCopyCommand.java:233)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:107)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
... 11 more
  1. I'm not able to get distcp to connect to google cloud hdfs namenode; I'm getting "Retrying connect to server". I couldn't find any documentation to configure connection between the cloudera hdfs cluster and google cloud hdfs cluster. I was under the assumption that the service account auth should work with google hdfs too. Is there a reference documentation I can use to set up copy between clusters? Is there any other authentication set up I'm missing?
Community
  • 1
  • 1
Kia
  • 43
  • 8
  • "*I know it's not a URI issue*" And how do you know that? How would the cloudera VM know what to do with `gs://`? That isn't a common URI. You are getting a `URISyntaxException`, so I want to say it is a URI issue – OneCricketeer Apr 27 '16 at 21:43
  • I meant to say that the same URI is working when I use a hadoop fs -cp. But distcp doesn't understand it? – Kia Apr 28 '16 at 10:18
  • The question you linked to references an older version of the connector (1.2.8 while the current version is 1.4.5). Can you verify which version you have installed on your cluster? This page has details for getting the connector: https://cloud.google.com/hadoop/google-cloud-storage-connector#getting Further, it might be worthwhile to attempt to write to a directory on GCS: gs://your_bucket/directory/file (I've verified vanilla hadoop distcp to the root of the bucket works in the latest version, but not with 1.2.8 and the error message seems to indicate its mangling the bucket). – Angus Davis Apr 28 '16 at 17:57
  • @Angus Thanks for your response. Yes, it is 1.2.8 v. Installing 1.4.5 is bringing up other class path issues. Need to check on that. And yes, I am trying to write to a directory on gcs. – Kia Apr 28 '16 at 21:17

1 Answers1

0

It turns out I had to modify firewall rules to allow tcp/http from the ip I was running distcp on. Check the networking firewalls on GCP compute instances.

Kia
  • 43
  • 8