Write to parquet row by row in Python

Question

I obtain messages in async cycle and from each message I parse row which is dictionary. I would like to write these rows into parquet. To implement this, I do the following:

fields = [('A', pa.float64()), ('B', pa.float64()), ('C', pa.float64()), ('D', pa.float64())]
schema = pa.schema(fields)
pqwriter = pq.ParquetWriter('sample.parquet', schema=schema, compression='gzip')

#async cycle starts here
async for message in messages:
   row = {'A': message[1], 'B': message[2], 'C': message[3], 'D': message[4]}
   table = pa.Table.from_pydict(row)
   pqwriter.write_table(table)
#end of async cycle
pqwriter.close()

Everything works perfect, however the resulting parquet-file is about ~5 Mb size, whereas if I perform writing to csv-file, I have the file of ~200 Kb size. I have checked that data types are the same (columns of csv are floatt, columns of parquet are floats)

Why my parquet is much larger than csv with the same data?

Does this answer your question? [How to use Pyarrow to achieve stream writing effect](https://stackoverflow.com/questions/56747062/how-to-use-pyarrow-to-achieve-stream-writing-effect) — Pace, Mar 11 '21 at 09:10

score 3 · Answer 1 · answered Mar 11 '21 at 08:50

3

Parquet is a columnar format which is optimized to write batches of data. It is not meant to be used to write data row by row.

It is not well suited for your use case. You may want to write intermediate rows of data in a more suitable format (say avro, csv) and then convert data in batches to parquet.

answered Mar 11 '21 at 08:50

0x26res

11,925
11
54
108

2

Is it possible to write rows into large batch using something like `BufferOutputStream` and that save this batch as parquet? – Artem Alexandrov Mar 12 '21 at 09:01

score 2 · Accepted Answer · edited Apr 05 '22 at 14:54

I have achieved the desired results as follows:

chunksize = 1e6
data = []
fields = #list of tuples
schema = pa.schema(fields)

with pq.ParquetWriter('my_parquet', schema=schema) as writer:
#async cycle starts here
rows = #dict with structure as in fields
data.extend(rows)

if len(data)>chunksize:
   data = pd.DataFrame(data)
   table = pa.Table.from_pandas(data, schema=schema)
   writer.write_table(table)
   data = []
#end of async cycle
if len(data)!=0:
   data = pd.DataFrame(data)
   table = pa.Table.from_pandas(data, schema=schema)
   writer.write_table(table)
writer.close()

This code snipped does actually what I need.

Write to parquet row by row in Python

2 Answers2