I have parquet data stored on S3 and Athena table partitioned by id and date. The parquet files are stored in
s3://bucket_name/table_name/id=x/date=y/
The parquet file contains the partition columns in them (id
, date
), because of which I am not able to read them using AWS Glue.
I would like to read the data in only a few partitions and hence I am making use of partition predicate as follows:
today = date.today()
yesterday = today - timedelta(days = 1)
predicate = "date = date '" + str(yesterday) +"'"
df =glueContext.create_dynamic_frame_from_catalog(database_name, table_name, push_down_predicate= predicate)
However, since the files already contain the partition columns, I am getting the below error:
AnalysisException: Found duplicate column(s) in the data schema and the partition schema:
id
,date
Is there a way I can read data from only a few partitions like this? Can I somehow read the data by ignoring id
and date
columns?
Any sort of help is appreciated :)