pyarrow convert string to dict array in table without going to pandas

Question

I have a daily process where I read in a historical parquet dataset and then concatenate that with a new file each day. I'm trying to optimize memory by making better use of arrows dictionary arrays. I want to avoid doing round trip to pandas systematically (and without defining columns) to get categoricals.

I'm wondering how to do this in pyarrow.

I currently do:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv

historical_table = pq.read_table(historical_pq_path)
new_table = (pa.Table.from_pandas(csv.read_csv(new_file_path)
                     .to_pandas(strings_to_categorical=True, 
                                split_blocks=True, 
                                self_destruct=True))
)
combined_table = pa.concat_tables([historical_table, new_table])

I process many files and would like to avoid having to maintain a schema for each file where I list the dictionary columns of each column and use that as read options to csv. The convenience of going to pandas with no column specification using strings_to_categorical=True is really nice. From what I've seen there isn't a way to do something like strings_to_dict natively in pyarrow.

Is there clean a way to do this in just pyarrow?

Do you mean you want the `Table` you get from `csv.read_csv` to use dictionary encoding? Have you tried this: https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions.auto_dict_encode ? Another option is to provide a schema with `ConvertOption.column_types` — 0x26res, May 21 '20 at 09:37
i must have missed the auto_dict_encode parameter! thanks for pointing out. — matthewmturner, May 21 '20 at 13:08

pyarrow convert string to dict array in table without going to pandas

0 Answers0