I have a use case where I have an S3 bucket containing a list of a few hundred gzipped files. Each individual file, when unzipped and loaded into a dataframe, occupies more than the available memory. I'd like to read these files and perform some functions. But, I can't seem to figure out how to make that work out. I've tried Dask's read_table, but since compression='gzip' seems to require blocksize=None, that doesn't seem to work. Is there something I'm missing,? or is there a better approach?
Here's how I'm trying to call read_table (note that names, types, and nulls are defined elsewhere):
df = dd.read_table(uri,
compression='gzip',
blocksize=None,
names=names,
dtype=types,
na_values=nulls,
encoding='utf-8',
quoting=csv.QUOTE_NONE,
error_bad_lines=False)
I'm open to any suggestions, including changing the entire approach.