2

When I try to open a large file from S3, I get memory error.

import dask.dataframe as dd
df = dd.read_csv('s3://xxxx/test_0001_part_03.gz', storage_options={'anon': True}, compression='gzip', error_bad_lines=False) 

df.head()
exception: MemoryError

How do I open large compressed files directly from S3?

shantanuo
  • 31,689
  • 78
  • 245
  • 403

2 Answers2

1

Short answer

You can't do this single large gzipped files because gzip compression does not allow for random access.

Long Answer

Usually with large files Dask will pull out blocks of data of a fixed size, like 128MB, and process them independently. However some compression formats like GZip don't allow for easy chunked access like this. You can still use GZipped data with Dask if you have many small files, but each file will be treated as a single chunk. If those files are large then you'll run into memory errors as you have experienced.

You can use dask.bag, which is usually pretty good about streaming through results. You won't get the Pandas semantics though and you won't get any parallelism within a single file.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
0

A couple of simple solutions that have probably already occurred to you:

  • store the file on S3 unzipped; potentially with much larger file-size and correspondingly slow transfer
  • download to a local unzipped file; of course, you need to have sufficient local storage.

The latter could be achieved as follows

import s3fs, gzip
s3 = s3fs.S3FileSystem(anon=True)
with s3.open('s3://xxxx/test_0001_part_03.gz', 'rb') as f1:
    with open('local_file', 'wb') as f2:
        f3 = gzip.GzipFile(fileobj=f1, mode='rb')
        out = True
        while out:
            out = f3.read(128*2**10)
            f2.write(out)
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • NB: we do not explicitly have a multi-threaded file downloader for S3, although one could be made like [this](https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/multithread.py#L57), but likely your bandwidth will saturate with one thread anyway. – mdurant Apr 11 '17 at 18:20