2

Is there a way to block offline browsers (like Teleport Pro, Webzip, etc...) that are showed in the logs as "Mozilla"?

Example: Webzip is showed in my site logs as "Mozilla/4.0 (compatible; MSIE 8.0; Win32)"

Teleport Pro is showed in my site logs as "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT)"

I did some tests using the .htaccess file but they all ended up with my browser blocked (Mozilla and Chrome) and of course I don't want to block normal visitors but just the bandwith leechers (that also eat alot of CPU/RAM with theyr requests), plus looks like that this offline browsers even ignore the robots.txt file. Anyone know a way to identify them and block them? If is possible give me examples.

Alex
  • 21
  • 2

3 Answers3

2

Short Answer: No

Long Answer:...

Most "Offline Browsers"/Scrapers just download the raw HTML/JS/CSS to be processed by the browser later. These, if their User-Agent Strings look like Legit "Online Browsers" that's all you have to go by and thus can't block them.

If they were to execute javascript on their scrape (usefull for sites that use javascript to load parts of the page etc.) then you can test their JavaScript API to see what features they have and target them this way. However this is pretty much pointless as they are likely to use a system like WebKit which other legit browsers also use.

Some Scrapers may abide by the Robots.txt file however they are more likely to be the scrapers like Google Search/Cache and not "Offline Browsers".

The last method is to use authentication that the downloads hide behind. This is effective so long as the user for the offline scraper doesn't provide it with an authenticated session.

Hope that helps :)

Ben Evans
  • 121
  • 2
1

I don't really have a good answer, just some ideas. But it's an interesting question. I don't think the answer is simple, except if someone else has put a ton of work into writing a program to do it. If they don't want to tell you they're robots, they don't have to. You would have to use some kind of tricks to see if they are.

Maybe you could put an invisible link at the top of the page, one a human wouldn't be able to follow, and then block anyone who does follow it.

By invisible, I mean put it into an html comment. I don't know enough about offline browsers to know if they are smart enough not to follow links inside html comments.

Anyone who follows a new link exactly every x seconds is also a robot. Block them.

Stuff like that.

user9517
  • 115,471
  • 20
  • 215
  • 297
Buttle Butkus
  • 1,741
  • 8
  • 33
  • 45
  • You have the privilege edit other people's posts, it's much more constructive to do so than make fun of them. – user9517 Jan 03 '13 at 10:03
  • @lain so now you're following me around, huh? I thought what he said was funny, that's all. I wasn't making fun of anyone. Honestly, I thought "bandwitch" might be like "automagically". – Buttle Butkus Jan 03 '13 at 10:09
0

If you need to protect your large downloads then the best way to handle that is to put them behind a logon. As you found out, messing with blocking via htaccess or robots against the user agent will run the risk of blocking legitimate traffic.

ceskib
  • 761
  • 1
  • 9
  • 24