Dask client runs out of memory loading from S3

Question

I have a s3 bucket with a lot of small files, over 100K that add up to about 700GB. When loading the objects from a data bag and then persist the client always runs out of memory, consuming gigs very quickly.

Limiting the scope to a few hundred objects will allow the job to run, but a lot of memory is being used by the client.

Shouldn't only futures be tracked by the client? How much memory do they take?

score 0 · Answer 1 · answered Aug 07 '18 at 02:56

Martin Durant answer on Gitter:

The client needs to do a glob on the remote file-system, i.e., download the full defiinition of all the files, in order to be able to make each of the bad partitions. You may want to structure the files into sub-directories, and make separate bags out of each of those

The original client was using a glob *, ** to load objects from S3.

With this knowledge, fetching all of the objects first with boto then using the list of objects, no globs, resulted in very minimal memory use by the client and a significant speed improvement.

If you can *generate* the filenames up front, even better. – mdurant Aug 07 '18 at 14:06 — mdurant, Aug 07 '18 at 14:06

Dask client runs out of memory loading from S3

1 Answers1