I am trying to use DuckDB to show the user-created schema that I have written into a Parquet file. I can demonstrate in Python (using the code example at Get schema of parquet file in Python) that the schema is as I desire, but cannot seem to find a way in DuckDB to get this information.
Neither of the following queries reports the user-created schema
select * from parquet_schema('FileWithMetadata.parquet')
select * from parquet_metadata('FileWithMetadata.parquet')
update:
Here is the code snippet that creates the metadata in the Parquet file where a is a Pandas dataframe of daily basin flows for a number of different simulations:
table = pa.Table.from_pandas(a)
my_schema = pa.schema([
pa.field("Flow", "float", True, metadata={"data":"flow in mm per day"}),
pa.field("DayIndex", "int64", False, metadata={"data":"index of days"}),
pa.field("BasinIndex", "string", True, metadata={"data":"flow in mm per day"}),
pa.field("Simulation", "int64", True, metadata={"data":"simulation number"})
],
metadata={"info":"long format basin flows"})
t2 = table.cast(my_schema)
pq.write_table(t2, 'SHALongWithMetadata1.parquet')
and the code to read it back is:
import pyarrow.parquet as pq
pfile = pq.read_table("C:\Projects\CSDP\PythonCSDP\Parquet\SHALongWithMetadata1.parquet")
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))
and this yields as output :
Column names: ['Flow', 'DayIndex', 'BasinIndex', 'Simulation']
Schema: Flow: float
-- field metadata --
data: 'flow in mm per day'
DayIndex: int64 not null
-- field metadata --
data: 'index of days'
BasinIndex: string
-- field metadata --
data: 'flow in mm per day'
Simulation: int64
-- field metadata --
data: 'simulation number'
-- schema metadata --
info: 'long format basin flows'