3

I have a daily process where I read in a historical parquet dataset and then concatenate that with a new file each day. I'm trying to optimize memory by making better use of arrows dictionary arrays. I want to avoid doing round trip to pandas systematically (and without defining columns) to get categoricals.

I'm wondering how to do this in pyarrow.

I currently do:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv

historical_table = pq.read_table(historical_pq_path)
new_table = (pa.Table.from_pandas(csv.read_csv(new_file_path)
                     .to_pandas(strings_to_categorical=True, 
                                split_blocks=True, 
                                self_destruct=True))
)
combined_table = pa.concat_tables([historical_table, new_table])

I process many files and would like to avoid having to maintain a schema for each file where I list the dictionary columns of each column and use that as read options to csv. The convenience of going to pandas with no column specification using strings_to_categorical=True is really nice. From what I've seen there isn't a way to do something like strings_to_dict natively in pyarrow.

Is there clean a way to do this in just pyarrow?

Seanny123
  • 8,776
  • 13
  • 68
  • 124
matthewmturner
  • 566
  • 7
  • 21
  • 1
    Do you mean you want the `Table` you get from `csv.read_csv` to use dictionary encoding? Have you tried this: https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions.auto_dict_encode ? Another option is to provide a schema with `ConvertOption.column_types` – 0x26res May 21 '20 at 09:37
  • i must have missed the auto_dict_encode parameter! thanks for pointing out. – matthewmturner May 21 '20 at 13:08

0 Answers0