AWS GLue Spark job: Found duplicate column(s) in the data schema and the partition schema: `day`, `month`, `year`

Asked Nov 17 '22 at 11:00

Active Nov 17 '22 at 11:00

Viewed 583 times

Glue spark job is failing with the error message: AnalysisException: Found duplicate column(s) in the data schema and the partition schema: day, month, year. And my actual parquet data file in S3 includes these partition columns as well.

Code snippet:

dyf= glueContext.create_dynamic_frame.from_catalog(database = db,table_name = tbl, additional_options={"catalogPartitionPredicate":"year>=2022 and month>=10 and day>=1"},transformation_ctx="dyf")
dyfdrop = dyf.drop_fields(paths=["year", "month", "day"])
dyfdrop.toDF().printSchema()
dyfdrop.toDF().show()

S3: s3://bucket/f1/f2/tbl/year=2022/month=10/day=1/

Partition column names in glue catalog: "year", "month", "day"

I am reading from Glue catalog by filtering partition column using above code.

I have also tried to drop partition columns based on this discussion.

Please help me to fix this in order to read these files.

Thanks

asked Nov 17 '22 at 11:00

Raaj

If you try to print the schema after the first line of code, what does it show? – ms12 Nov 17 '22 at 11:39
Thanks I tried this and I am getting the same duplicate column error – Raaj Nov 17 '22 at 14:40
how about using `spark.read` on S3 paths instead of dynamicframe read from catalog? – AdibP Nov 18 '22 at 02:07

AWS GLue Spark job: Found duplicate column(s) in the data schema and the partition schema: `day`, `month`, `year`

0 Answers0