1

Glue spark job is failing with the error message: AnalysisException: Found duplicate column(s) in the data schema and the partition schema: day, month, year. And my actual parquet data file in S3 includes these partition columns as well.

Code snippet:

dyf= glueContext.create_dynamic_frame.from_catalog(database = db,table_name = tbl, additional_options={"catalogPartitionPredicate":"year>=2022 and month>=10 and day>=1"},transformation_ctx="dyf")
dyfdrop = dyf.drop_fields(paths=["year", "month", "day"])
dyfdrop.toDF().printSchema()
dyfdrop.toDF().show()

S3: s3://bucket/f1/f2/tbl/year=2022/month=10/day=1/

Partition column names in glue catalog: "year", "month", "day"

I am reading from Glue catalog by filtering partition column using above code.

I have also tried to drop partition columns based on this discussion.

Please help me to fix this in order to read these files.

Thanks

Raaj
  • 29
  • 2
  • 6

0 Answers0