I'm trying to read a month's worth of Reddit data from Pushshift. These files are around 30gigs, compressed (zst format) of json data. I am trying to convert these files to parquet.
Here is the code:
time ~/prog/duckdb -c "copy (select created_utc, body, author, subreddit, parent_id, id from read_ndjson_auto('RC_2022-01.zst')) to 'RC_2022-01.parquet' (format 'PARQUET', CODEC 'ZSTD')"
The weird thing is, I can process the most recent file just fine. This file also happens to be the biggest. Perhaps something changed in the zst parameters they used to compress the file. This also implies that this isn't a bug in duckdb.
However, I do need to know how to change framesize or increase memory available to the zst module. Not sure how to control that in duckdb.
EDIT: Additional information, since this is really a ZSTD problem, not as much a duckdb problem.
When I try to uncompress the zstd file, I get this error:
zstd -d RC_2022-01.zst -o test
RC_2022-01.zst : Decoding error (36) : Frame requires too much memory for decoding
RC_2022-01.zst : Window size larger than maximum : 2147483648 > 134217728
RC_2022-01.zst : Use --long=31 or --memory=2048MB
As far as I can tell, duck doesn't allow me to pass in these parameters.