1

Is it possible to block web crawler from downloading files (like zip file) in my server?

I supposed to create a PHP script using cookies to track visitors specially web crawlers to login/register after downloading 3 files. But I found out that web crawler can bypass cookies.

Is it possible to block web crawler? Or is there any other option that will hide the files from web crawler after it downloads up to 3 files?

I can easily create a PHP script using cookies to force visitors login/register, but how about web crawler?

By the way, I'm using nginx and drupal CMS. Just giving this info if this can help.

jaYPabs
  • 299
  • 1
  • 4
  • 20
  • 1
    Did you know that normal browsers can refuse cookies too? Tracking people who don't want to be tracked is not a trivial problem. You can solve the web crawlers problem by using [a `robots.txt` file](http://www.robotstxt.org/). – Ladadadada Jul 27 '13 at 14:51
  • I don't think you didn't know that bad web crawler don't follow what robots.txt says. – jaYPabs Jul 27 '13 at 14:53
  • 1
    Yes, you can only stop *good* crawlers with a `robots.txt` file. Techniques to identify the bad ones would fill a book. – Ladadadada Jul 27 '13 at 15:02
  • I'm thinking of using PHP only without cookies by recording the # of visits. But I don't know if this is a good idea since it will add additional load to the server. What do you think of this? – jaYPabs Jul 27 '13 at 15:05
  • 2
    The important question is: does it really hurt you id the crawler downloads the files? – Christopher Perrin Jul 27 '13 at 18:32
  • @ChristopherPerrin Yes, of course. Think about the bandwidth consumed by web crawler. – jaYPabs Jul 29 '13 at 12:56

1 Answers1

0

So, if you've properly designed your site there will be no difference in the security you need for a client versus some type of crawler. Based on the fact you said you're relying on cookies to track this, a malicious client can easily bypass your "security". It sounds like you are only handling the case where the client is well behaved. This is fine for some sites (hell, the NYTimes does it). It's up to you to decide if you need the additional security (which can add complexity), or if you're fine without it.

Crawlers don't necessarily send cookies, but then again neither do normal web browsers. About the only feasible solution here is to track downloads via IP address (though this becomes useless w/IPv6).

devicenull
  • 5,622
  • 1
  • 26
  • 31