0

I'm trying to run a DeltaStreamer job to push data to S3 bucket using the following cmd:

spark-submit  \
    --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
    --conf spark.hadoop.fs.s3a.endpoint=s3.ap-south-1.amazonaws.com \
    --conf spark.hadoop.fs.s3a.access.key='AA..AA' \
    --conf spark.hadoop.fs.s3a.secret.key='WQO..IOEI' \
    --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
    --table-type COPY_ON_WRITE \
    --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
    --source-ordering-field cloud.account.id \
    --target-base-path s3a://test \
    --target-table test1_cow \
    --props /var/demo/config/kafka-source.properties \
    --hoodie-conf hoodie.datasource.write.recordkey.field=cloud.account.id \
    --hoodie-conf hoodie.datasource.write.partitionpath.field=cloud.account.id \
    --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

This returns the following error:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 9..1, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: G..g=
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    ...

I think I'm using the correct S3 endpoint. Do I need to create an S3 Access Point? I'm following the versions mentioned in https://hudi.apache.org/docs/docker_demo.html (https://github.com/apache/hudi/tree/master/docker).

ProgramSpree
  • 372
  • 5
  • 21

1 Answers1

1

That AWS region is v4 signing only, so you must set the endpoint to the region.

But: that version of the hadoop-* JAR and AWS SDK doesn't handle setting endpoints through the fs.s3a.endpoint option, It is four years old, after all -before any of the v4-only AWS regions were launched.

Upgrade the hadoop version to something written in the last 2-3 years. My recommendation is for Hadoop 3.3.1 or 3.2.2.

that is:

  1. all of the hadoop-* JAR, not just individual JARs. To try to upgrade just hadoop-aws.jar will only give you new stack traces.
  2. And a matching sdk bundle JAR. Mvn repo shows the version you need.

Easiest is to go to hadoop.apache.org, download an entire release and then extract the jars.

stevel
  • 12,567
  • 1
  • 39
  • 50