Glue spark job is failing with the error message: AnalysisException: Found duplicate column(s) in the data schema and the partition schema: day
, month
, year
. And my actual parquet data file in S3 includes these partition columns as well.
Code snippet:
dyf= glueContext.create_dynamic_frame.from_catalog(database = db,table_name = tbl, additional_options={"catalogPartitionPredicate":"year>=2022 and month>=10 and day>=1"},transformation_ctx="dyf")
dyfdrop = dyf.drop_fields(paths=["year", "month", "day"])
dyfdrop.toDF().printSchema()
dyfdrop.toDF().show()
S3: s3://bucket/f1/f2/tbl/year=2022/month=10/day=1/
Partition column names in glue catalog: "year", "month", "day"
I am reading from Glue catalog by filtering partition column using above code.
I have also tried to drop partition columns based on this discussion.
Please help me to fix this in order to read these files.
Thanks