Force hadoop to use for S3 connection

Question

I'm trying to upload file to S3 using hadoop:

hadoop fs -Dfs.s3a.connection.ssl.enabled=false -Dfs.s3a.proxy.host=127.0.0.1 -Dfs.s3a.proxy.port=8123 -put pig_1421167148680.log s3a://access:secret@bucket/temp/trash

But I can't force hadoop to use proxy.

16/01/08 11:57:27 INFO http.AmazonHttpClient: Unable to execute HTTP
request: Connect to bucket.s3.amazonaws.com:80 timed out
com.cloudera.org.apache.http.conn.ConnectTimeoutException: Connect to

Proxy is working perfectly fine. I can access S3 bucket using AWS CLI.

score 1 · Answer 1 · answered Aug 04 '16 at 10:05

For this, you need to use distcp not hadoop fs command as hadoop fs works on your local HDFS cluster and distcp is the way to copy between clusters (and S3 si seen as a cluster).

For this to works, I put all properties in hdfs-site.xml on each nodes (because distcp is ditributed on all nodes) not on the command line.

So, add in your hdfs-site.xml files on each node the following properties :

<property>
  <name>fs.s3a.access.key</name>
  <value>your_access_key</value>
</property>
<property>
  <name>fs.s3a.secret.key</name>
  <value>your_secret_key</value>
</property>
<property>
  <name>fs.s3a.proxy.host</name>
  <value>your_proxy_host</value>
</property>
<property>
  <name>fs.s3a.proxy.port</name>
  <value>your_proxy_port</value>
</property>

Force hadoop to use for S3 connection

1 Answers1