1

I have a job that gets data from a database, runs some code, and uploads the result to an S3 bucket. The code takes approximately 1 minute to run, and the result file is approximately 10MB. Both the EC2 instances and the S3 bucket are in us-west-1.

I run separate instances of this job on multiple EC2 m3.large instances at once. With up to about 175 instances, the upload takes less than a second. It's not very many simultaneous requests; maybe up to 5/second. Shortly after I increase it to 200 instances, the upload takes 40-60 seconds and sometimes even longer.

It seems like this shouldn't be an unusual amount of data to send to S3, and the individual machines seem to be doing fine (CPU 40-50%). What could be causing this? Could I be hitting a network bandwidth limit? If so, how can I tell?

The files were named with a unique id, so I tried reversing the id to spread out the keys (as described https://cloudnative.io/blog/2015/01/aws-s3-performance-tuning/). That didn't change the behavior.

kielni
  • 113
  • 2
  • 1
    Are you in a VPC private subnet going through a NAT instance? – ceejayoz May 07 '15 at 14:43
  • yes, what should that tell me? – kielni May 07 '15 at 15:10
  • Splitting the machines into multiple subnets was too complicated, and I can't make them public. For now I am sending the data to an FTP server inside the VPC instead of S3. VPC Endpoints give direct access to S3, and seem relatively easy to implement: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-endpoints.html – kielni May 14 '15 at 15:55
  • Yeah, the new VPC endpoints are perfect for this. Must've been a few people hitting your use case for them to release it! – ceejayoz May 14 '15 at 15:56

1 Answers1

1

You're probably hitting a bottleneck on the NAT instance. Driving 200 servers worth of large HTTP requests through the one server is probably taxing it too much (be it CPU or network bandwidth). Split your servers across multiple subnets with multiple NAT instances, or put them in a public subnet with direct networking to S3 rather than via a NAT.

ceejayoz
  • 32,910
  • 7
  • 82
  • 106