2

I have been getting a lot of web hits in my logs that crawl most top level pages of my site and show a referrer as a Java version.

I see different variants of the Java versions in the referrer, i.e. Java/1.6.0_04, Java/1.4.1_04, Java/1.7.0_25, etc.

And sometimes, but not always, I get a 404 for /contact/ but none of the other pages below.

The IPs are usually always spam harvesters and bots, according to Project Honeypot

78.129.252.190 - - [24/Jan/2014:01:28:52 -0800] "GET / HTTP/1.1" 200 6728 "-" "Java/1.6.0_04" 198 7082
78.129.252.190 - - [24/Jan/2014:01:28:55 -0800] "GET /about HTTP/1.1" 301 - "-" "Java/1.6.0_04" 203 352
78.129.252.190 - - [24/Jan/2014:01:28:55 -0800] "GET /about/ HTTP/1.1" 200 29933 "-" "Java/1.6.0_04" 204 30330
78.129.252.190 - - [24/Jan/2014:01:28:56 -0800] "GET /articles-columns HTTP/1.1" 301 - "-" "Java/1.6.0_04" 214 363
78.129.252.190 - - [24/Jan/2014:01:28:57 -0800] "GET /articles-columns/ HTTP/1.1" 200 29973 "-" "Java/1.6.0_04" 215 30370
78.129.252.190 - - [24/Jan/2014:01:28:58 -0800] "GET /contact HTTP/1.1" 301 - "-" "Java/1.6.0_04" 205 354
78.129.252.190 - - [24/Jan/2014:01:28:58 -0800] "GET /contact/ HTTP/1.1" 200 47424 "-" "Java/1.6.0_04" 206 47827

What are they looking for? A vulnerability?

Can I block these visits by their Java referrer? If so, how? Or with a php function?

I know how to block IPs in .htaccess, but blocking by User-agent is a more proactive method for me).

Update 2/04/14 I'm not able to block a Java User-agent with either of these two rules.

RewriteCond %{HTTP_USER_AGENT} Java/1.6.0_04
RewriteRule ^.*$ - [F]

RewriteCond %{HTTP_USER_AGENT} ^Java
RewriteRule ^.*$ - [F]

Note: I'm on shared hosting and have limited access to apache configs.

markratledge
  • 519
  • 5
  • 13
  • 26
  • I think you mean User-Agent string, not referrer. And, are you asking for us to code this for you?! – Michael Hampton Jan 27 '14 at 15:54
  • 2
    I'm not asking for someone to code it for me; it's a question, like any other. And it's multi-part: what's the vulnerability, if there is one? How would this work? Should I block them with the more labor-intensive .htaccess method? – markratledge Jan 27 '14 at 17:06
  • [`RewriteLog`](http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html#rewritelog) and [`RewriteLogLevel`](http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html#rewriteloglevel). Gather some information about the rewrite processing and share it. Your condition might need to be less specific. Try matching on "^Java.*" for example to block any version. (I doubt the User-Agent is just "Java".) – Aaron Copley Feb 10 '14 at 15:40
  • @AaronCopley, thanks, but I'm restricted from using RewriteLog. But I am trying the match for Java as you suggested. – markratledge Feb 10 '14 at 17:07

2 Answers2

3

User Agent string matching is not reliable method, as anyone can change it on headers.

From my experience, every internet facing webserver is akin to be crawled and surfed (that's THE point, right? :).

If anything, they're just crawling your webserver for indexing of some sort. If you want to frustrate or limit the frequency for those requests, I'd suggest apache mod_evasive, or mod_dosevasive, or mod_qos, to limit the number of concurrent connections per IP per second, and more.

Keep in mind that this solution could lead to your webserver blocking legitimate requests from NAT routed requests and so on.

Then, you'll need to code the 403 forbidden yourself defining a set of rules from crawling behaviour into your php app when bots learn your apache mods evasive frequency setup.

Marcel
  • 1,730
  • 10
  • 15
0

Is AllowOverride set to All?

As a more proper solution, I would recommend using mod_evasive[1] to block excessive scanning by any client. Requires iptables though.

  1. http://www.zdziarski.com/blog/?page_id=442
antimatter
  • 229
  • 1
  • 7