Writing stream of big data to Parquet with Python

Question

I want to write a stream of big data to a parquet file with Python. My data is huge, and I cannot keep them in memory and write them in one go.

I find two Python libraries (Pyarrow, Fastparquet) which could read and write on a Parquet file. This is the solution that I am using Pyarrow, but I am happy to try another library if you may know a working solution:

import pandas as pd
import random
import pyarrow as pa
import pyarrow.parquet as pq


def data_generator():
    # This is a simulation for my generator function
    # It is not allowed to change the nature of this function
    options = ['op1', 'op2', 'op3', 'op4']
    while True:
        dd = {'c1': random.randint(1, 10), 'c2': random.choice(options)}
        yield dd


result_file_address = 'example.parquet'
index = 0

try:
    dic_data = next(data_generator())
    df = pd.DataFrame(dic_data, [index])
    table = pa.Table.from_pandas(df)
    with pq.ParquetWriter(result_file_address, table.schema,
                          compression='gzip', use_dictionary=['c1', 'c2']
                          ) as writer:
        writer.write_table(table)
        for dic_data in data_generator():
            index += 1
            df = pd.DataFrame(dic_data, [index])
            table = pa.Table.from_pandas(df)
            writer.write_table(table=table)
except StopIteration:
    pass
finally:
    del data_generator

I have the following issues with the above code:

All data is accumulated in the RAM and at the end of the process will write down to the disk which is not practical for me due to the size limitation of the RAM.
I can significantly reduce the size of the end results with 7zip. It seems the compression is not working.
I am getting following warning from using use_dictinary:

Traceback (most recent call last):
  File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found
Exception ignored in: 'pyarrow._parquet.ParquetWriter._set_dictionary_props'
Traceback (most recent call last):
  File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found

Many thanks in advance !!!

The error you get from `use_dictionary` is a bug either in the documentation or the implementation. For now, you can work around it using `use_dictionary=[b'c1', b'c2']`. — malthe, Jan 10 '20 at 11:32
@malthe I can confirm that this solved the issue for me, when I had the exact same error — Niklas B, Mar 31 '20 at 08:36
It's probably not much help, but I had the same issue and didn't find any good solution in python. Ended up first writing to an intermediate data structure, then doing the Parquet creation in Go using [parquet-go](https://github.com/xitongsys/parquet-go). — Dave Challis, Feb 19 '21 at 09:13

Writing stream of big data to Parquet with Python

0 Answers0