Merging Parquet Files - Pandas Meta in Schema Mismatch

Question

I am trying to merge multiple parquet files into one. Their schemas are identical field-wise but my ParquetWriter is complaining that they are not. After some investigation I found that the pandas meta in the schemas are different, causing this error.

Is it possible to ignore/merge/delete pandas meta? Do I even need pandas meta?

import pyarrow.parquet as pq

pq_tables=[]
for file_ in files:
    pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
    pq_tables.append(pq_table)
    if writer is None:
        writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, use_deprecated_int96_timestamps=True)
    writer.write_table(table=pq_table)

The exact error-

Traceback (most recent call last):
  File "{PATH_TO}/main.py", line 68, in lambda_handler
    writer.write_table(table=pq_table)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 335, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:

Can you open an issue on the Apache Arrow JIRA instance about this? — Wes McKinney, Nov 08 '18 at 19:11
Sure. Created at https://issues.apache.org/jira/browse/ARROW-3728 — micah, Nov 08 '18 at 19:27
@WesMcKinney Do you know of anything I can do in the meantime? No idea how long it will take for AA to solve this. Is it possible to remove the pandas meta and will that cause any problems? — micah, Nov 08 '18 at 19:44
It is possible but perhaps not obvious how to do it. Not very many Arrow people look at StackOverflow so I would suggest asking about this on the mailing list — Wes McKinney, Nov 08 '18 at 19:58
I think I found the solution- `pq_table = pq_table.replace_schema_metadata(None)`. You can't stop pyarrow from loading the metadata, but you can clear it by creating a shallow copy of the table without metadata. — micah, Nov 08 '18 at 21:16

Merging Parquet Files - Pandas Meta in Schema Mismatch

0 Answers0