0

When tracking down a PHP notification of WordPress in canonical.php where parse_url() created an array without 'path' entry, we found this related line (with a corresponding 301 redirect) in access log:

188.165.XXX.XXX - - [29/Jun/2016:07:58:34 +0200] "GET ?subject=Company-Name - Contact via website HTTP/1.1" 301 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12" 1603141 430 520

188.165.XXX.XXX - - [29/Jun/2016:07:58:36 +0200] "GET /?subject=Company-Name HTTP/1.1" 200 4908 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12" 404908 433 5445

Seems to be some bot which takes an existing mailto: link from the site and tries to access it via http:.. and finally ends up on main homepage.

Note the missing leading slash in first GET request.

Any idea how this can/does happen?

Tried to reproduce such entries with php file_get_contents() or curl and similar, but to no avail, always had the leading slash in access log.

Website is at some shared hosting, phpinfo says: "Linux vhost01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64" with "CGI/FastCGI" and "Apache 2.0 Handler" at SAPI modules. Can't see exactly which Apache 2 version :-(

Edit: All other log entries have a leading slash.

1 Answers1

0

I'd say this would reproduce the problem:

echo -e "GET ?subject=Company-Name HTTP/1.1\r\nHost: www.example.com\r\n\r\n" | nc <your IP> 80

As to why it's happening, well, your guess (that someone's scraping mailto links) is quite plausible. A lot of stupid people write software.

womble
  • 96,255
  • 29
  • 175
  • 230