I have a job that gets data from a database, runs some code, and uploads the result to an S3 bucket. The code takes approximately 1 minute to run, and the result file is approximately 10MB. Both the EC2 instances and the S3 bucket are in us-west-1.
I run separate instances of this job on multiple EC2 m3.large instances at once. With up to about 175 instances, the upload takes less than a second. It's not very many simultaneous requests; maybe up to 5/second. Shortly after I increase it to 200 instances, the upload takes 40-60 seconds and sometimes even longer.
It seems like this shouldn't be an unusual amount of data to send to S3, and the individual machines seem to be doing fine (CPU 40-50%). What could be causing this? Could I be hitting a network bandwidth limit? If so, how can I tell?
The files were named with a unique id, so I tried reversing the id to spread out the keys (as described https://cloudnative.io/blog/2015/01/aws-s3-performance-tuning/). That didn't change the behavior.