2

I am trying to use DuckDB to show the user-created schema that I have written into a Parquet file. I can demonstrate in Python (using the code example at Get schema of parquet file in Python) that the schema is as I desire, but cannot seem to find a way in DuckDB to get this information.

Neither of the following queries reports the user-created schema

select * from parquet_schema('FileWithMetadata.parquet') 
select * from parquet_metadata('FileWithMetadata.parquet')

update:

Here is the code snippet that creates the metadata in the Parquet file where a is a Pandas dataframe of daily basin flows for a number of different simulations:

table = pa.Table.from_pandas(a)
my_schema = pa.schema([
    pa.field("Flow", "float", True, metadata={"data":"flow in mm per day"}),
    pa.field("DayIndex", "int64", False, metadata={"data":"index of days"}),
    pa.field("BasinIndex", "string", True, metadata={"data":"flow in mm per day"}),
    pa.field("Simulation", "int64", True, metadata={"data":"simulation number"})
    ],

    metadata={"info":"long format basin flows"})

t2 = table.cast(my_schema)
pq.write_table(t2, 'SHALongWithMetadata1.parquet')

and the code to read it back is:

import pyarrow.parquet as pq
pfile = pq.read_table("C:\Projects\CSDP\PythonCSDP\Parquet\SHALongWithMetadata1.parquet")
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))

and this yields as output :

Column names: ['Flow', 'DayIndex', 'BasinIndex', 'Simulation']
Schema: Flow: float
  -- field metadata --
  data: 'flow in mm per day'
DayIndex: int64 not null
  -- field metadata --
  data: 'index of days'
BasinIndex: string
  -- field metadata --
  data: 'flow in mm per day'
Simulation: int64
  -- field metadata --
  data: 'simulation number'
-- schema metadata --
info: 'long format basin flows'
Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
rbmales
  • 143
  • 1
  • 8

1 Answers1

2

Use the DESCRIBE keyword:

DESCRIBE
SELECT *
FROM 'FileWithMetadata.parquet';

DESCRIBE TABLE
'FileWithMetadata.parquet';

Table function parquet_metadata repeats the same information for each row group in the Parquet file, so use SELECT DISTINCT to report only one set of unique column names and types:

SELECT DISTINCT
  path_in_schema,
  type 
FROM parquet_metadata('FileWithMetadata.parquet');
Derek Mahar
  • 27,608
  • 43
  • 124
  • 174
  • Sorry does not work, I had tried that previously, just yields column name, type, null, key, and default fields: Flow FLOAT YES DayIndex BIGINT YES BasinIndex VARCHAR YES Simulation BIGINT YES – rbmales Apr 21 '23 at 12:25
  • What kind of schema do you want? DuckDB reports the SQL schema. – Derek Mahar Apr 21 '23 at 12:28
  • I have edited my original post to show the input user-defined metadata creation in Python, the script to read it back, and the results. This is to allow the Parquet file to be self-describing, similar to netcdf and HDF5 files. – rbmales Apr 21 '23 at 13:02