13

I need to transform data from JSON to parquet as a part of an ETL pipeline. I'm currently doing it with the from_pandas method of a pyarrow.Table. However building a dataframe first feels like a unnecessary step, plus I'd like to avoid having pandas as a dependency.

Is there a way how to write parquet files without the need to load it in a dataframe first?

Milan Cermak
  • 7,476
  • 3
  • 44
  • 59

1 Answers1

4

At the moment, the most convenient way to build Parquet is using Pandas due to the maturity of it. Nevertheless, pyarrow also provides facilities to build it's tables from normal Python:

import pyarrow as pa

string_array = pa.array(['a', 'b', 'c'])
pa.Table.from_arrays([string_array], ['str'])

As Parquet is a columnar data format, you will have to load the data once into memory to do the row-wise to columnar data representation transformation.

At the moment, you also need to construct the Arrow arrays at once; you cannot build them up incrementally. In future, we plan to expose the (incremental) builder classes from C++: https://github.com/apache/arrow/pull/1930

Uwe L. Korn
  • 8,080
  • 1
  • 30
  • 42