1

I have a problem.

My web crawler run correctly from home and university, even if the pages I need are in /pgol/ and the robots.txt says this:

# File controlled by PUPPET: do not modify!!!
# /robots.txt file for http://www.paginegialle.it

User-Agent: bingbot
Crawl-delay: 30

User-Agent: msnbot
Crawl-delay: 30

User-agent: *
Disallow: /pgol/
Disallow: /pg/cgi/
Disallow: /pgolfe/
Disallow: /info/*.html

User-Agent: bingbot
Crawl-delay: 30

User-Agent: msnbot
Crawl-delay: 30

Sitemap: http://www.paginegialle.it/sitemap.xml
Sitemap: http://www.paginegialle.it/sitemap_fe.xml

but when I run it from work the site recognize me immediately as a robot and send me this page:

<!DOCTYPE html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/pgol/4-Benzinaio/3-Roma/p=1?mr=50&distil_RID=06AFED2E-B651-11E3-8450-306F5DBA1712" />
<script type="text/javascript" src="/ga.137584219024.js?PID=6D4E4D1D-7094-375D-A439-0568A6A70836" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#centersf323034b,#Freddy231a90d5,#category58c315d5,#Freddy231a90d5{display:none!important}</style></head>
<body>
<div id="distil_ident_block">&nbsp;</div>
<div id="d__fFH"><OBJECT id="d_dlg" CLASSID="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></OBJECT><span id="d__fF"></span></div></body>
</html>

I think this was caused by a colleague of mine that made lots of bad request, and the server registered our IP as a bad robot.

I don't know what is the effective functioning of the server, so what I just said could be wrong.

I'm using Java, in particular crawler4j from Google Code

Can you explain me the situation and can you propose me any solutions?

Baldo
  • 19
  • 2
  • But your web crawler is a robot, the site doesn't want to be crawled by robots. Perhaps you need to change where you work. – Jodrell Mar 28 '14 at 09:05
  • I'm hacking, not cracking. I work for a software house that makes web-sites, and I don't want to stole informations: i need to test the limits of security and learn how to improve it. so... can you explain me what is happening? Why does it work everywhere else? – Baldo Mar 28 '14 at 09:17
  • Did you try to use a proxy? – Oleg Mar 28 '14 at 09:20
  • No. Tell me more about it @Oleg . I remember you that I'm using Java, in particular crawler4j from Google Code – Baldo Mar 28 '14 at 09:22
  • Try to Google the following: https://www.google.com/#q=how+to+change+external+ip https://www.google.com/#q=how+to+set+up+proxy – Oleg Mar 28 '14 at 09:39
  • @Baldo, because you run bots from the ip address at your workplace, its been blacklisted as as address from which bots are run, probably because a page was crawled that was in a sites `robots.txt`. If you use a proxy, your requests will be go via another, probably un-blacklisted ip address. To prevent that other ip address being blacklisted you could either, not use it for robots or, make your robot appear more like a user of a normal web browser. – Jodrell Mar 28 '14 at 10:32
  • @Baldo Either, obey the robots.txt or, use a `User-Agent` header like a brower, use other headers like a browser. Only navigate to reosurces that are visible in a browser and do it all with a throttle, essentially behave like a real user. – Jodrell Mar 28 '14 at 10:33
  • @Baldo Most of all, don't misuse any information you acquire. Otherwise, you workplace could suffer from more than a blacklisted ip address. – Jodrell Mar 28 '14 at 10:39
  • Thank you very much @Jodrell and Oleg. I'm trying what you just said. Don't worry about the data crawled, we don't need them and we have good intentions ;) PS: Sorry but I'm new here: how can I rate your comments as useful? – Baldo Mar 28 '14 at 11:00
  • Hover the mouse cursor over a comment, and click arrow up at the left of the comment. – Oleg Mar 28 '14 at 11:43
  • (The problem is that I haven't the arrows :/ ) Anyway I have another question about this: using proxies I noticed that the server recognize me as a robot after only 4 pages, and it's not a matter of time because I look for a page every 10 seconds. I see that if I look for the same page from my browser with Javascript disabled, the server think I'm a robot even using Chrome and after a few seconds it sends me the captcha code! So, is there a way to make a request from Java using Javascript like a Browser? @Oleg and Jodrell – Baldo Mar 28 '14 at 12:18

1 Answers1

1

robots.txt are like stops and no way signs. If you want to bypass them, you can. That's why more restrictive methods (such as abusive IP filtering) are often implemented against those who do not comply with robots.txt.

It does not matter your 'good intention', you should respect robots.txt in the first place.

Jérôme Verstrynge
  • 57,710
  • 92
  • 283
  • 453