I want to write a stream of big data to a parquet file with Python. My data is huge, and I cannot keep them in memory and write them in one go.
I find two Python libraries (Pyarrow, Fastparquet) which could read and write on a Parquet file. This is the solution that I am using Pyarrow, but I am happy to try another library if you may know a working solution:
import pandas as pd
import random
import pyarrow as pa
import pyarrow.parquet as pq
def data_generator():
# This is a simulation for my generator function
# It is not allowed to change the nature of this function
options = ['op1', 'op2', 'op3', 'op4']
while True:
dd = {'c1': random.randint(1, 10), 'c2': random.choice(options)}
yield dd
result_file_address = 'example.parquet'
index = 0
try:
dic_data = next(data_generator())
df = pd.DataFrame(dic_data, [index])
table = pa.Table.from_pandas(df)
with pq.ParquetWriter(result_file_address, table.schema,
compression='gzip', use_dictionary=['c1', 'c2']
) as writer:
writer.write_table(table)
for dic_data in data_generator():
index += 1
df = pd.DataFrame(dic_data, [index])
table = pa.Table.from_pandas(df)
writer.write_table(table=table)
except StopIteration:
pass
finally:
del data_generator
I have the following issues with the above code:
- All data is accumulated in the RAM and at the end of the process will write down to the disk which is not practical for me due to the size limitation of the RAM.
- I can significantly reduce the size of the end results with 7zip. It seems the compression is not working.
- I am getting following warning from using use_dictinary:
Traceback (most recent call last):
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found
Exception ignored in: 'pyarrow._parquet.ParquetWriter._set_dictionary_props'
Traceback (most recent call last):
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found
Many thanks in advance !!!