I am reading a set of arrow files and am writing them to a parquet file:
import pathlib
from pyarrow import parquet as pq
from pyarrow import feather
import pyarrow as pa
base_path = pathlib.Path('../mydata')
fields = [
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
]
schema = pa.schema(fields)
with pq.ParquetWriter('sample.parquet', schema) as pqwriter:
for file_path in base_path.glob('*.arrow'):
table = feather.read_table(file_path)
pqwriter.write_table(table)
My problem is that the code
field in the arrow files is defined with an int8
index instead of int32
. The range of int8
however is insufficient. Hence I defined a schema with a int32
index for the field code
in the parquet file.
However, writing the arrow table to parquet now complains that the schemas do not match.
How can I change the datatype of the arrow column? I checked the pyarrow API and did not find a way to change the schema. Can this be done without roundtripping to pandas?