0

I'm trying to load all incoming parquet files from an S3 Bucket, and process them with delta-lake. I'm getting an exception.

val df = spark.readStream().parquet("s3a://$bucketName/")

df.select("unit") //filter data!
        .writeStream()
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", checkpointFolder)
        .start(bucketProcessed) //output goes in another bucket
        .awaitTermination()

It throws an exception, because "unit" is ambiguous. exception

I've tried debugging it. For some reason, it finds "unit" twice.

debugging

What is going on here? Could it be an encoding issue?

edit: This is how I create the spark session:

val spark = SparkSession.builder()
    .appName("streaming")
    .master("local")
    .config("spark.hadoop.fs.s3a.endpoint", endpoint)
    .config("spark.hadoop.fs.s3a.access.key", accessKey)
    .config("spark.hadoop.fs.s3a.secret.key", secretKey)
    .config("spark.hadoop.fs.s3a.path.style.access", true)
    .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
    .config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
    .config("spark.sql.caseSensitive", true)
    .config("spark.sql.streaming.schemaInference", true)
    .config("spark.sql.parquet.mergeSchema", true)
    .orCreate

edit2: output from df.printSchema()

2020-10-21 13:15:33,962 [main] WARN  org.apache.spark.sql.execution.datasources.DataSource -  Found duplicate column(s) in the data schema and the partition schema: `unit`;
root
 |-- unit: string (nullable = true)
 |-- unit: string (nullable = true)
Tamás
  • 19
  • 1
  • 6
  • Can you try running it with `spark.sql.caseSensitive=true`? Note: it's not recommended, only do this for debugging. But it could be the parquet contains both a `UNIT` and a `unit` column (parquets are case-sensitive). Also, please specify the version of Spark you're working with. – RealSkeptic Oct 20 '20 at 10:04
  • I've tried it just now. it made no difference. Both "unit" and "unit" look the same to me. Both lower caps anyway. spark version: org.apache.spark:spark-sql_2.12:3.0.1 – Tamás Oct 20 '20 at 11:47
  • When did you apply the setting? Of course they are the same in your example, otherwise they would not have both passed the filter. But the setting should be applied when the session is created, before reading the parquet. – RealSkeptic Oct 20 '20 at 11:50
  • I've added the spark session creation code in an edit. Spark session creation is the first thing that happens ofc (before reading the parquet). After that, I don't change the config. – Tamás Oct 20 '20 at 11:57
  • Well, for some reason you have two columns with the same name. Now that I see your config, it could be due to schema inference or schema merging. It would be best to check the headers of your individual parquets (you can use AWS S3 select to query a single object), or disable those options. It may also be helpful to use `printSchema` to get some insight. – RealSkeptic Oct 20 '20 at 12:43

1 Answers1

0

Reading the same data like this...

val df = spark.readStream().parquet("s3a://$bucketName/*")

...solves the issue. For whatever reason. I would love to know why... :(

Tamás
  • 19
  • 1
  • 6