- Will, in general, blocking by IP address work? I know it's been a long time since "IP Address == unique device on Internet", but I'm wondering if these sort of probes generally come from the sort of networks where it'd be safe for me to just block them outright
You can quite easily block many of the requests using a simple .htaccess file. There you can block IPs, urls and plenty of things. But I'm not sure what the source of your "bad requests" are. What I do know is that you should start by stopping the bad traffic that we know of. And this can be done if you make your goal a bit bigger, and rather stop denial of services attacks while at the same time limiting bad requests! Everything you need is at this very useful resource. However they don't really say which modules to install. I recommend: mod_antiloris, mod_evasive but you can find loads more here.
I would personally look at setting up some of those before moving on to hard-blocking certain urls or ips. However, if you want to start limiting specific patterns, it might be easier doing so using a PHP scrict. I.e. Route all paramters to index.php and analyse them there. This would still require a re-route using the .htaccess file. Drupal does something like this:
# Pass all requests not referring directly to files in the filesystem to
# index.php. Clean URLs are handled in drupal_environment_initialize().
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !=/favicon.ico
RewriteRule ^ index.php [L]
By doing this, you can "trap" every incominng url. Drupal actually has this built-in and will tell you that person X was looking for file Y. And Drupal again also has modules that can block certain access with certain rules. If that is possible, I'm sure hooking into PHP will expose you to tons of different options you can use to block or not block access from ips.
I think I proposed a solution, but i do need more info to advise further. If you do the above, you will be able to gather more information to perhaps pinpoint the exact source of your bad request woes. Using these tools you will be able to see patterns and at the very least, learn better ways to configure rules to block the bad guys.
- If I can't block by IP address, does anyone maintain a list of URLs that bad actors generally probe for so I can block by URL?
There are apache modules that does this and make use of their own libraries. There are also libraries for PHP that does this and various networks that keep track of "bad guys", whether spammers using IPs, or spamming using Email addresses etc. Here's an entire list of people keeping track of servers that get blacklisted for a variety of reasons. Try it out by entering www.google.com.
- Re #2. If I was going to handle this blocking at the server level, which apache module is right for this sort of thing? (MOD_REWRITE, MOD_SECURITY)
MOD_REWRITE would work to get the request to a PHP file, after which you can deal with the problem in PHP. But this does have a bit of overhead. You are better of using MOD_SECURITY and maybe MOD_EVASIVE
- Is there a better way to block these requests other than by IP or URL?
It really depends. You must study the patterns that emerge and identify the cause. I got very frustrated that we kept getting requests for "transparent.png" (or something) which turned out to be a new standard request for many mobile phones. We thought it was bad, it was good. Don't end up doing that.
- Also, the system is hosted on EC2 -- does amazon offer any help with this sort of thing?
I don't know. Out of my own personal experience, which was more with the using it to SEND info, we got blacklisted quite quickly when even sending less than 2500 emails. But if you are hosting with them and want them to block incoming "bad requests", they should already be doing that to some extent. Unless you have a mass bot army attacking your server every few days, should you ask them to intervene. Perhaps ask them to help you identify the source, or do your own investigation and decide from there.