I want to create an Apache Spark DataFrame from a S3 resource. I've tried on AWS and on IBM S3 Clout Object Store, both fail with
org.apache.spark.util.TaskCompletionListenerException: Premature end of Content-Length delimited message body (expected: 2,250,236; received: 16,360)
I'm running pyspark with
./pyspark --packages com.amazonaws:aws-java-sdk-pom:1.11.828,org.apache.hadoop:hadoop-aws:2.7.0
I'm setting the S3 configuration for IBM with
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xx")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xx")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-de.cloud-object-storage.appdomain.cloud")
Or AWS with
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xx")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", " xx ")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-west-2.amazonaws.com")
In both cases the following code: df=spark.read.csv("s3a://drill-test/cases.csv")
It fails with the exception
org.apache.spark.util.TaskCompletionListenerException: Premature end of Content-Length delimited message body (expected: 2,250,236; received: 16,360)