Writing parquet files from Python without pandas

Question

I need to transform data from JSON to parquet as a part of an ETL pipeline. I'm currently doing it with the from_pandas method of a pyarrow.Table. However building a dataframe first feels like a unnecessary step, plus I'd like to avoid having pandas as a dependency.

Is there a way how to write parquet files without the need to load it in a dataframe first?

The README for parquet-python says "currently with only read-support" — Milan Cermak, May 04 '18 at 12:24

score 4 · Answer 1 · answered May 04 '18 at 12:45

At the moment, the most convenient way to build Parquet is using Pandas due to the maturity of it. Nevertheless, pyarrow also provides facilities to build it's tables from normal Python:

import pyarrow as pa

string_array = pa.array(['a', 'b', 'c'])
pa.Table.from_arrays([string_array], ['str'])

As Parquet is a columnar data format, you will have to load the data once into memory to do the row-wise to columnar data representation transformation.

At the moment, you also need to construct the Arrow arrays at once; you cannot build them up incrementally. In future, we plan to expose the (incremental) builder classes from C++: https://github.com/apache/arrow/pull/1930

Does it only work for homogeneous arrays? So far I was unsuccessful building a mixed-type one. — Milan Cermak, May 04 '18 at 16:15

Writing parquet files from Python without pandas

1 Answers1