-1

I am a dev-ops web developer with a site running two ec2.smalls behind a load balancer on AWS.

Recently we saw 3-4 requests per second take down our clients site.

The site was down and would not come back after multiple server reboots and errors log scans for any scripts that might be causing the issue, even though no changes were recently pushed.

After I turned on load balancer logging I saw that 1000s of requests to a single page were coming from one IP address.

We forwarded the request from the load balancer to the server handling the request using X-forwarded-for and blocked the IP using an .htaccess rule.

While in communication with clients IT, they were notified that the IP address responsible for the flood of requests was in fact one of their internal company machines.

The responsible machine was remotely rebooted and all requests stopped. The site came back online.

The official explanation for this was "the computer was freaking out".

Is it possible for a web browser or windows machine to make 3-4 requests per second to a load balanced web page and take it down for 5+ hours?

Here is what the requests looked like:

2017-01-14T01:00:46.170447Z west-ssl XX.XXX.XX.XXX:33370 - -1 -1 -1 503 0 0 0 "GET https://www.example.com:443/example/12 HTTP/1.1" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko" ECDHE-RSA-AES128-SHA256 TLSv1.2
zeros-and-ones
  • 315
  • 1
  • 3
  • 10
  • 2
    Computer software can fail in strange ways, it's written by humans who are infinitely fallible. You probably need to increase protections (AWS has DDOS protection that runs at the network edge, or it might be practical to do it on your instance) or auto scale your service to meet demand. – Tim Jan 17 '17 at 19:07
  • I've accidentally setup crons with infinite loops etc. If your site can be crashed with 3-4 requests per second, especially to a single URL, you need to optimize and cache stuff. – ceejayoz Jan 17 '17 at 19:11
  • `...and blocked the IP using an .htaccess rule.` Why? Use the server config, it's much more efficient. If you have access to the server config you pretty much don't need to use .htaccess. – user9517 Jan 17 '17 at 19:16
  • Convenience / rushed was why. I agree that it could have been placed in the apache config. Did not realize that is faster than placing the rule in htaccess – zeros-and-ones Jan 17 '17 at 19:19
  • @tim, if we had autoscaling in place, and the attack continued over the weekend, would we have potentially 10s or 100s of servers running by now? I worry that autoscaling could end up costing a lot if these attacks become frequent. – zeros-and-ones Jan 17 '17 at 19:37
  • 1
    Yes, which is why you set an upper limit of instances. You need to decide what is most important, availability or cost. – Tim Jan 17 '17 at 19:46
  • Ah thanks, did not realize you can have a limit, I was avoiding autoscaling because of this. Is the through so called Elastic Beanstalk or can you set it up outside of this service? – zeros-and-ones Jan 17 '17 at 19:48
  • Elastic Beanstalk is a kind of managed deployment service, I haven't used it, not sure if you can set limits - probably. If you use auto scaling directly you can set whatever limts you like. https://aws.amazon.com/autoscaling/ – Tim Jan 17 '17 at 21:45

1 Answers1

2

Sure it's possible - though it depends on a number of factors:

1) It sounds like the server side application is having issues with concurrency. Might be worth looking at if it was the application servers that were the bottleneck, or if it was upstream such as the DB's and the application servers ran out of memory due to the apache config not flushing threads fast enough. If it was the application servers, it might be worth doing some tuning - spin up an identical machine outside of the ELB and use JMeter to throw some load at it to figure out the bottlenecks.

If it was the database, you may be able to use memcache/elasticache (since it looks like you are retrieving a specific object) to cache the actual queries. That way the db connections respond quickly, Apache can respond quickly, and kill off threads rather than fill up the application machine's memory pool.

If you are really feeling vulnerable, you could put Varnish upstream to cache the requests at a 1-5s TTL to prevent a major request flood. But be careful as VCL is unforgiving and can lead to major issues and pain (cache poisoning/leakage).

2) As for the "subject" machine itself. Obviously it could have been compromised - it should definitely be investigated. I'll let you decide if the IT guy is honest or not - that's outside the realm of serverfault.

Assuming it was not compromised, it could have been some bad javascript code - if you do polling refreshes and somehow a timing parameter was modified, it could very well start sending many requests per second. Likewise, the JS may have been well behaved but the person may have had 25 tabs open and went home for the evening - if each is sending 1 request per 5 seconds, that's 5req/second.

Brennen Smith
  • 1,742
  • 8
  • 11
  • RE: Javascript refreshes, would this crash the browser? – zeros-and-ones Jan 17 '17 at 19:17
  • Not necessarily - it could but it would take quite a volume of requests, or some form of memory leak in the browser, to crash the browser itself. But given that the requests didn't stop until the machine was fully rebooted, it doesn't sound like the browser crashed. – Brennen Smith Jan 17 '17 at 19:36
  • More details, I ran top on each machine handling requests after removing from load balancer and then adding back in. httpd would have 5 processes running until all memory was used up and the health check failed, taking the instance out of the load balancer. So it should not have been a DB issue right? Also, the DB is running on RDS. – zeros-and-ones Jan 17 '17 at 19:43
  • Definitely could be DB issue. When you make a request to the HTTP endpoint and it requires a DB fetch, Apache holds the thread open until the entire transaction is complete - this includes the DB request. If the DB is unable to keep up, requests will pile up and thus threads will be held open as they have not completed yet. Then you run out of memory and you see the issue you just described above. – Brennen Smith Jan 17 '17 at 19:45
  • PHP is correct, looks like we have our Cache driver set to file so its caching on the server itself. Should this be optimized? – zeros-and-ones Jan 17 '17 at 19:47
  • Setting up Laravel correctly (I assume that's what you're using) is beyond the scope of a comment thread, but even if it's correctly setup, you may need to optimize the SQL queries itself, or make sure that it's actually hitting the Laravel cache. I'd recommend instrumenting your code with Librato, New Relic, or another method and tracing your DB call performace. – Brennen Smith Jan 17 '17 at 19:49
  • Of course, just curious if you had any comments related to the performance of file based caching vs database/redis/memcached. Thanks, for tips I'm downloading JMeter now. – zeros-and-ones Jan 17 '17 at 21:09