duckdb getting 'Error: IO Error: Frame requires too much memory for decoding' error when reading very large files

Question

I'm trying to read a month's worth of Reddit data from Pushshift. These files are around 30gigs, compressed (zst format) of json data. I am trying to convert these files to parquet.

Here is the code:

time ~/prog/duckdb -c "copy (select created_utc, body, author, subreddit, parent_id, id from read_ndjson_auto('RC_2022-01.zst')) to 'RC_2022-01.parquet' (format 'PARQUET', CODEC 'ZSTD')"

The weird thing is, I can process the most recent file just fine. This file also happens to be the biggest. Perhaps something changed in the zst parameters they used to compress the file. This also implies that this isn't a bug in duckdb.

However, I do need to know how to change framesize or increase memory available to the zst module. Not sure how to control that in duckdb.

EDIT: Additional information, since this is really a ZSTD problem, not as much a duckdb problem.

When I try to uncompress the zstd file, I get this error:

zstd -d RC_2022-01.zst -o test
RC_2022-01.zst : Decoding error (36) : Frame requires too much memory for decoding
RC_2022-01.zst : Window size larger than maximum : 2147483648 > 134217728
RC_2022-01.zst : Use --long=31 or --memory=2048MB

As far as I can tell, duck doesn't allow me to pass in these parameters.

score 1 · Answer 1 · edited Apr 07 '23 at 00:31

1

zstd -d --long=31 RC_2022-01.zst -o test

I recently encountered this problem and the solution that worked for me was adding --long=31 to the decompression command.

edited Apr 07 '23 at 00:31

Jeremy Caney

7,102
69
48
77

answered Mar 28 '23 at 00:05

Haps

11
1

duckdb getting 'Error: IO Error: Frame requires too much memory for decoding' error when reading very large files

1 Answers1