0

I have a s3 bucket with a lot of small files, over 100K that add up to about 700GB. When loading the objects from a data bag and then persist the client always runs out of memory, consuming gigs very quickly.

Limiting the scope to a few hundred objects will allow the job to run, but a lot of memory is being used by the client.

Shouldn't only futures be tracked by the client? How much memory do they take?

Kevin McGrath
  • 146
  • 1
  • 5

1 Answers1

0

Martin Durant answer on Gitter:

The client needs to do a glob on the remote file-system, i.e., download the full defiinition of all the files, in order to be able to make each of the bad partitions. You may want to structure the files into sub-directories, and make separate bags out of each of those

The original client was using a glob *, ** to load objects from S3.

With this knowledge, fetching all of the objects first with boto then using the list of objects, no globs, resulted in very minimal memory use by the client and a significant speed improvement.

Kevin McGrath
  • 146
  • 1
  • 5