Unable to load parquet files with same columns names but with a different order.
Scenario:
ABD-MacBook-Pro:ttt abd$ tree
.
├── testing1.paquet
└── testing2.paquet
I have two parquet files as mentioned above. The column names are the same in both the files but just the order is different and I was able to load these files using Spark. Could you please let me know if I miss anything here? or is this not supported by pyarrow?
I'm trying to load those parquet files using the below command.
pandas_df = pq.ParquetDataset('ttt', filesystem=file_system).read_pandas().to_pandas()
Getting the below error on running above command.
ValueError: Schema in ttt//testing2.paquet was different.
C1: string
C2: string
C3: string
C4: string
Unnamed: 4: double
Unnamed: 5: double
Unnamed: 6: double
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "C1", "field_name": "C1", "pandas_type": "unicode", "'
b'numpy_type": "object", "metadata": null}, {"name": "C2", "field_'
b'name": "C2", "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": null}, {"name": "C3", "field_name": "C3", "pandas_typ'
b'e": "unicode", "numpy_type": "object", "metadata": null}, {"name'
b'": "C4", "field_name": "C4", "pandas_type": "unicode", "numpy_ty'
b'pe": "object", "metadata": null}, {"name": "Unnamed: 4", "field_'
b'name": "Unnamed: 4", "pandas_type": "float64", "numpy_type": "fl'
b'oat64", "metadata": null}, {"name": "Unnamed: 5", "field_name": '
b'"Unnamed: 5", "pandas_type": "float64", "numpy_type": "float64",'
b' "metadata": null}, {"name": "Unnamed: 6", "field_name": "Unname'
b'd: 6", "pandas_type": "float64", "numpy_type": "float64", "metad'
b'ata": null}, {"name": null, "field_name": "__index_level_0__", "'
b'pandas_type": "int64", "numpy_type": "int64", "metadata": null}]'
b', "pandas_version": "0.23.0"}'}
vs
C1: string
C2: string
C4: string
C3: string
Unnamed: 4: double
Unnamed: 5: double
Unnamed: 6: double
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "C1", "field_name": "C1", "pandas_type": "unicode", "'
b'numpy_type": "object", "metadata": null}, {"name": "C2", "field_'
b'name": "C2", "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": null}, {"name": "C4", "field_name": "C4", "pandas_typ'
b'e": "unicode", "numpy_type": "object", "metadata": null}, {"name'
b'": "C3", "field_name": "C3", "pandas_type": "unicode", "numpy_ty'
b'pe": "object", "metadata": null}, {"name": "Unnamed: 4", "field_'
b'name": "Unnamed: 4", "pandas_type": "float64", "numpy_type": "fl'
b'oat64", "metadata": null}, {"name": "Unnamed: 5", "field_name": '
b'"Unnamed: 5", "pandas_type": "float64", "numpy_type": "float64",'
b' "metadata": null}, {"name": "Unnamed: 6", "field_name": "Unname'
b'd: 6", "pandas_type": "float64", "numpy_type": "float64", "metad'
b'ata": null}, {"name": null, "field_name": "__index_level_0__", "'
b'pandas_type": "int64", "numpy_type": "int64", "metadata": null}]'
b', "pandas_version": "0.23.0"}'}