1

Possible Duplicate:
Ethics of Robots.txt

I am trying out Mechanize to automate some work on a site. I have managed to bypass above error by using br.set_handle_robots(False). How ethical it's to use it?

If not, then I thought of obeying 'robots.txt', but the site I am trying to mechanize is blocking me from viewing robots.txt, does this means no bots are allowed to it? Whats should be my next steps?

Thanks in advance.

Community
  • 1
  • 1
Avi
  • 368
  • 2
  • 13

1 Answers1

1

For your first question, see Ethics of robots.txt

You need to keep in mind the purpose of robots.txt. Robots that are crawling a site can potentially wreck havoc on the site and essentially cause a DoS attack. So if your "automation" is crawling at all or is downloading more than just a few pages every day or so, AND the site has a robots.txt file that excludes you, then you should honor it.

Personally, I find a little grey area. If my script is working at the same pace as a human using a browser and is only grabbing a few pages then I, in the spirit of the robots exclusion standard, have no problem scrapping the pages so long as it doesn't access the site more than once a day. Please read that last sentence carefully before judging me. I feel it is perfectly logical. Many people may disagree with me there though.

For your second question, web servers have the ability to return a 403 based on the User-Agent attribute of the HTTP header sent with your request. In order to have your script mimic a browser, you have to miss-represent yourself. Meaning, you need to change the HTTP header User-Agent attribute to be the same as the one used by a mainstream web browser (e.g., Firefox, IE, Chrome). Right now it probably says something like 'Mechanize'.

Some sites are more sophisticated than that and have other methods for detecting non-human visitors. In that case, give up because they really don't want you accessing the site in that manner.

Community
  • 1
  • 1
stuckintheshuck
  • 2,449
  • 3
  • 27
  • 33