1

I'm having an issue storing a large dataset (around 40GB) in a single parquet file.

I'm using the fastparquet library to append pandas.DataFrames to this parquet dataset file. The following is a minimal example program that appends chunks to a parquet file until it crashes as the file-size in bytes exceeds the int32 threshold of 2147483647 (2.1GB):

Link to minimum reproducible example code

Everything goes fine until the dataset hits 2.1GB, at which point I get the following errors:

OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'

Because the exception is ignored internally, it's very hard to figure out which specific thrift it's upset about and get a stack trace. However, it's very clear that it is linked to the file size exceeding the int32 range.

Also these thrift definitions come from the parquet format repo itself, so I wonder if this is a limitation built into the design of the parquet format?

Alex Pilafian
  • 121
  • 1
  • 5
  • It can exceed 2.3 GB. How are you appending rows? Its best if you share the code snippet. – ns15 Nov 25 '22 at 05:10
  • @shetty15 I updated my question to contain the explicit code snippet that illustrates how exactly I'm writing to the parquet file – Alex Pilafian Nov 25 '22 at 13:47
  • @shetty15 today I've updated the question to link to a gist with minimal example code that reproduces the issue. The code snippet is dead-simple, and I feel like it should work. Yet, it crashes right when the filesize exceeds int32 bounds... – Alex Pilafian Nov 26 '22 at 14:01

1 Answers1

1

Finally, I figured out that I was running into a genuine bug in the python library fastparquet, which resulted in a fix in the main library.

This is a link to the salient issue on Github.

The commit in which the issue is fixed is 89d16a2.

Alex Pilafian
  • 121
  • 1
  • 5