BingBot & BaiduSpider don't respect robots.txt

Question

After my CPU usage suddenly went over 400% due to bots swamping my site, I created a robots.txt as followed and placed the file in my root, eg "www.example.com/":

User-agent: *
Disallow: /

Now Google respects this file and there is no more occurence in my log file of Google. However BingBot & BaiduSpider still show up in my log (and plentyful).

As I had this huge increase in CPU usage & also bandwith and my hosting provider was about to suspend my account, I firstly deleted all my pages (in case there was a nasty script), uploaded clean pages, blocked all bots via IP address in .htaccess & then created that robots.txt file.

I searched everywhere to confirm that I did the right steps (haven't tried the "ReWrite" option in .htaccess yet).

Can anyone confirm that what I have done should do the job? (Since I started this venture, my CPU usage went down to 120% within 6 days, but at least blocking the IP addresses should have brought down the CPU usage to my usual 5-10%).

sadly, robots.txt is a "gentlemen's agreement", if you have access to a firewall then you could block them outright, other people have the same problem you have: http://www.webmasterworld.com/search_engine_spiders/4348357.htm (ip addresses to ban in this link) — Harald Brinkhof, Jul 10 '12 at 23:47
Hi Harald, thanks for the link. Blocked them outright via ip address. Guess thats why they are not reading my robots.txt and meta tags(I changed). Cpu usage down to 51%, so now I let a few ip addresses through so they can read the robots.txt rules & meta tag rules and will see how it goes.Thanks again, Richard — Richard, Jul 15 '12 at 02:49

MrWhite · Answer 1 · 2015-08-17T07:14:59.570

3

If these are legitimate spiders from Bingbot and Baiduspider then they should both honour your robots.txt file as given. However, it can take time before they pick it up and start acting on it if these files have previously been indexed - which is probably the case here.

It doesn't apply in this instance, but it should be noted that Baiduspider's interpretation of the robots.txt standard is a little different to other mainstream bots (ie. Googlebot) in some respects. For instance, whilst the standard defines the URL path on the Disallow: record simply as a prefix, the Baiduspider will only match whole directory/path names. Where the Googlebot will match the URL http://example.com/private/ when given the directive Disallow: /priv, the Baiduspider will not.

Reference:
http://www.baidu.com/search/robots_english.html

edited Aug 17 '15 at 07:14

answered Apr 01 '14 at 11:16

MrWhite

43,179
8
60
84

Hi, thanks for the info... but the link is now broken. Does anyone know where that got shifted to? – rosuav Aug 17 '15 at 04:20
@rosuav I've updated the link (whether this is _exactly_ the same page I'm not sure?). However, the examples are not at all clear - to the point of being contradictory. Under the `Disallow` directive, it states "`Disallow: /help` disallows ... `/helpabc.html`", however, in the table of examples that follow it implies that `Disallow: /tmp` would _not_ disallow `/tmphoho`! They also give the same example twice (`Disallow: /tmp` and URL `/tmp`) and in one it matches and the other it doesn't!? (That really doesn't make sense, so maybe something has been lost in translation!?) – MrWhite Aug 17 '15 at 07:37

BingBot & BaiduSpider don't respect robots.txt

1 Answers1