0

I'm writing some DataFrame to binary parquet format with one or more entire null object columns.

If I then load the parquet dataset with use_legacy_dataset=False

parquet_dataset = pq.ParquetDataset(root_path, use_legacy_dataset=False, **kwargs)
type(parquet)
pyarrow.parquet._ParquetDatasetV2

It returns an _ParquetDatasetV2 instance and when I check the schema.

type(parquet_dataset.schema) 
pyarrow.lib.Schema

If I load the same file but with use_legacy_dataset=True

parquet_dataset2 = pq.ParquetDataset(root_path, use_legacy_dataset=True, **kwargs)

The schema for the file is an instance of ParquetSchema

type(parquet_dataset2.schema)
pyarrow._parquet.ParquetSchema

This is as I would expect and I'm aware that I can get the "arrow schema" like this.

arrow_schema = parquet_dataset2.schema.to_arrow_schema()
type(arrow_schema)
pyarrow.lib.Schema

i.e same format as when I use use_legacy_dataset=False

For an instance of ParquetSchema, I can get details of any column. e.g

parquet_dataset2.schema[13]

<ParquetColumnSchema>
  name: col13
  path: col13
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT96
  logical_type: None
  converted_type (legacy): NONE

Here the "physical_type" for this column is INT96.

parquet.schema[13].physical_type
'INT32'

For an instance of pyarrow.lib.Schema, if I get the "data type" for the same column.

parquet_dataset.schema.field("col13").type
DataType(null)

i.e with no information about what the "data type" is supposed to be.

This information is available in the Parquet file. But how do I access it?

Is there way to convert instance of pyarrow.lib.Schema -> pyarrow._parquet.ParquetSchema?

mishbah
  • 5,487
  • 5
  • 25
  • 35
  • 1
    Hmm, that sounds like a bug to me. I'm not quite reproducing the issue (I don't have your parquet files) with 6.0.1. Can you try this gist and see what kind of results you get: https://gist.github.com/westonpace/fedc9771eee4f57efdf29c9fd3c32eb2 If you can't reproduce it with the gist then can you help me understand what I'm missing? – Pace Dec 15 '21 at 21:01
  • In your gist, you are explicitly passing "schema" as input for "write table". In my code Im using `pa.Table.from_pandas(df)` where the column in question is completely empty (i.e only null values) I attached a screenshot to your gist. Thank you for your help. – mishbah Dec 15 '21 at 21:32
  • Just checked my environment, Im using also using pyarrow==6.0.1 – mishbah Dec 15 '21 at 21:54

0 Answers0