Can a parquet file exceed 2.1GB?

Question

I'm having an issue storing a large dataset (around 40GB) in a single parquet file.

I'm using the fastparquet library to append pandas.DataFrames to this parquet dataset file. The following is a minimal example program that appends chunks to a parquet file until it crashes as the file-size in bytes exceeds the int32 threshold of 2147483647 (2.1GB):

Link to minimum reproducible example code

Everything goes fine until the dataset hits 2.1GB, at which point I get the following errors:

OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'

Because the exception is ignored internally, it's very hard to figure out which specific thrift it's upset about and get a stack trace. However, it's very clear that it is linked to the file size exceeding the int32 range.

Also these thrift definitions come from the parquet format repo itself, so I wonder if this is a limitation built into the design of the parquet format?

It can exceed 2.3 GB. How are you appending rows? Its best if you share the code snippet. — ns15, Nov 25 '22 at 05:10
@shetty15 I updated my question to contain the explicit code snippet that illustrates how exactly I'm writing to the parquet file — Alex Pilafian, Nov 25 '22 at 13:47
@shetty15 today I've updated the question to link to a gist with minimal example code that reproduces the issue. The code snippet is dead-simple, and I feel like it should work. Yet, it crashes right when the filesize exceeds int32 bounds... — Alex Pilafian, Nov 26 '22 at 14:01

score 1 · Accepted Answer · answered Nov 26 '22 at 20:35

1

Finally, I figured out that I was running into a genuine bug in the python library fastparquet, which resulted in a fix in the main library.

This is a link to the salient issue on Github.

The commit in which the issue is fixed is 89d16a2.

answered Nov 26 '22 at 20:35

Alex Pilafian

121
1
5

1

Interesting find. Looks like it's been increased now to 64 bits. – ns15 Dec 01 '22 at 15:05

Can a parquet file exceed 2.1GB?

1 Answers1