Unable to query parquet data which has array datatype

Asked Feb 28 '23 at 19:10

Active Mar 01 '23 at 20:10

Viewed 51 times

When using awswrangler and writing to S3 in parquet format, the data files are not queryable using S3 select (for csv) or Athena.

For e.g.

events = [{"c1": "12", "c2": [1, 2, 3, 6], "c3": 1234}]
df = pd.DataFrame.from_dict(events)
wr.s3.to_parquet(
                    df=df,
                    path=f"{bucket}/{path}",
                    partition_cols=partition_cols,
                    dataset=True
                )

Attempting to access with target output format as csv gives the following error and no data being retrieved when accessing through Athena (even after the partition metadata is altered). But, if we change the target output format to JSON OR if we remove the array column from the payload and write, then it's accessible via csv.

create external table default.test_wrangler_parquet(c1 string, c2 array<integer>)
PARTITIONED BY(c3 string) 
STORED AS PARQUET
LOCATION 's3://test_bucket/tmp/wrangler_parquet/';

alter table default.test_wrangler_parquet add partition (c3='1234')

{"c1": "12", "c2": [{ "item": 1 }, { "item": 2 }, { "item": 3 }, { "item": 6 }], "c3": 1234}

edited Mar 01 '23 at 20:10

asked Feb 28 '23 at 19:10

Raman

1

It would help if you'd add the table DDL. – Tom Slabbaert Mar 01 '23 at 12:47
Added the DDL below – Raman Mar 01 '23 at 20:11

Unable to query parquet data which has array datatype

0 Answers0