When using awswrangler and writing to S3 in parquet format, the data files are not queryable using S3 select (for csv) or Athena.
For e.g.
events = [{"c1": "12", "c2": [1, 2, 3, 6], "c3": 1234}]
df = pd.DataFrame.from_dict(events)
wr.s3.to_parquet(
df=df,
path=f"{bucket}/{path}",
partition_cols=partition_cols,
dataset=True
)
Attempting to access with target output format as csv gives the following error and no data being retrieved when accessing through Athena (even after the partition metadata is altered). But, if we change the target output format to JSON OR if we remove the array column from the payload and write, then it's accessible via csv.
create external table default.test_wrangler_parquet(c1 string, c2 array<integer>)
PARTITIONED BY(c3 string)
STORED AS PARQUET
LOCATION 's3://test_bucket/tmp/wrangler_parquet/';
alter table default.test_wrangler_parquet add partition (c3='1234')
{"c1": "12", "c2": [{ "item": 1 }, { "item": 2 }, { "item": 3 }, { "item": 6 }], "c3": 1234}