Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

More on Wikipedia

98 questions
1
vote
0 answers

What are the symptoms of an overloaded webserver

I'm maintaining some web crawlers. I want to improve our load/throttling system to be more intelligent. Of cause I look at response codes, and throttle up or down based on that. I would though like the system to be better at dynamically adjusting…
Niels Kristian
  • 358
  • 1
  • 3
  • 13
1
vote
2 answers

Protection against scrapping with nginx

This morning we had a crawler going nuts on our server hitting our site almost 100 times per second. We'd like to add a protection for this. I guess I'' have to use HttpLimitReqModule but I don't want to block allow google/bing/... How should I do…
bl0b
  • 141
  • 1
  • 6
1
vote
1 answer

How to Block Web Crawler from Downloading File

Is it possible to block web crawler from downloading files (like zip file) in my server? I supposed to create a PHP script using cookies to track visitors specially web crawlers to login/register after downloading 3 files. But I found out that web…
jaYPabs
  • 299
  • 1
  • 4
  • 20
1
vote
0 answers

How to Exclude Log Using Fail2Ban logpath with Wildcard Settings

I'm using wildcard in the logpath value as shown below: [http-get-dos] enabled = true filter = http-get-dos logpath = /var/log/ispconfig/httpd/*/access.log maxretry = 250 findtime = 300 #ban for 10 hours bantime = 36000 action =…
jaYPabs
  • 299
  • 1
  • 4
  • 20
1
vote
1 answer

How to ban web crawler using fail2ban

I am using nginx and I am always hit by web crawler if I am correct. I tried to configure fail2ban but the IP address cannot be detected by fail2ban. The reason that it is not detected because it seems that it is a legitimate visitor. Here's the…
jaYPabs
  • 299
  • 1
  • 4
  • 20
1
vote
1 answer

Googlebot repeatedly looks for files that aren't on my server

I'm hosting a site for a volunteer organization. I've moved the site to WordPress, but it wasn't always that way. I suspect at one point it was hacked badly. My Apache error log file has grown to 122 kB in just the past 18 hours. The large…
John
  • 167
  • 5
1
vote
2 answers

How to create a global robots.txt that gets appended to each domain's own robots.txt on Apache?

I know can create ONE robots.txt file for all domains on an Apache server*, but I want to append to each domain's (if pre-existing) robots.txt. I want some general rules in place for all domains, but I need to allow different domains to have their…
Gaia
  • 1,855
  • 5
  • 34
  • 60
1
vote
3 answers

How to block this URL pattern in Varnish VCL?

My website is getting badly hit by spambots and scrappers. I've used Cloudflare but the problem still remains there. The problem is spambots accessing non-existing urls causing a lot of load to my drupal backend which goes all the way and…
iTech
  • 355
  • 4
  • 15
1
vote
3 answers

Blocking 'good' bots in nginx with multiple conditions for certain off-limits URL's where humans can go

After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.txt file (even after…
Glenn Plas
  • 221
  • 3
  • 8
1
vote
1 answer

Does a forward web proxy exist that checks and obeys robots.txt on remote domains?

Does there exist a forward proxy server that will lookup and obey robots.txt files on remote internet domains and enforce them on behalf of requesters going via the proxy? e.g. Imagine a website at www.example.com that has a robots.txt file that…
wodow
  • 590
  • 1
  • 6
  • 18
1
vote
1 answer

How big would a MySQL database be if I save all webpages' title and URL in it?

For learning purposes, I want to make a simple web indexer which crawls the web and saves all found pages in a MySQL database with their titles and URLs, with this table (the page's content is not saved): id: integer AUTO_INCREMENT PRI title:…
user42235
1
vote
3 answers

Should I ban spiders?

A rails template script that I've been looking at automatically adds User-Agent: and Dissalow: in robots.txt thereby banning all spiders from the site What are the benefits of banning spiders and why would you want to?
marflar
  • 397
  • 1
  • 2
  • 9
1
vote
1 answer

Yahoo AdCrawler hammering our site

Yahoo AdCrawler is re-trying some URLs repeatedly. The URLs are being given a 302 response code, so I suppose Yahoo should come back and try again "later", but "later" in my book doesn't mean that 7 specific URLs should be hit 3,000 times a day…
Kristen
  • 187
  • 8
1
vote
2 answers

Copy a website and preserve the file & folder structure

I have an old web site running on an ancient version of Oracle Portal that we need to convert to a flat-html structure. Due to damage to the server we are not able to access the administrative interface, and even if we could there is no export…
DrStalker
  • 6,946
  • 24
  • 79
  • 107
1
vote
3 answers

Is it worthwhile to block malicious crawlers via iptables?

I periodically check my server logs and I notice a lot of crawlers search for the location of phpmyadmin, zencart, roundcube, administrator sections and other sensitive data. Then there are also crawlers under the name "Morfeus Fucking Scanner" or…
aardbol
  • 1,473
  • 4
  • 17
  • 26