I have a daily process where I read in a historical parquet
dataset and then concatenate that with a new file each day. I'm trying to optimize memory by making better use of arrows dictionary arrays. I want to avoid doing round trip to pandas systematically (and without defining columns) to get categoricals.
I'm wondering how to do this in pyarrow
.
I currently do:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
historical_table = pq.read_table(historical_pq_path)
new_table = (pa.Table.from_pandas(csv.read_csv(new_file_path)
.to_pandas(strings_to_categorical=True,
split_blocks=True,
self_destruct=True))
)
combined_table = pa.concat_tables([historical_table, new_table])
I process many files and would like to avoid having to maintain a schema for each file where I list the dictionary columns of each column and use that as read options to csv. The convenience of going to pandas with no column specification using strings_to_categorical=True
is really nice. From what I've seen there isn't a way to do something like strings_to_dict
natively in pyarrow
.
Is there clean a way to do this in just pyarrow?