Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

What are the symptoms of an overloaded webserver

I'm maintaining some web crawlers. I want to improve our load/throttling system to be more intelligent. Of cause I look at response codes, and throttle up or down based on that. I would though like the system to be better at dynamically adjusting…

web-server web-crawler throttling

asked Mar 05 '15 at 11:17

Niels Kristian

vote

2 answers

Protection against scrapping with nginx

This morning we had a crawler going nuts on our server hitting our site almost 100 times per second. We'd like to add a protection for this. I guess I'' have to use HttpLimitReqModule but I don't want to block allow google/bing/... How should I do…

nginx ddos web-crawler flooding scraping

asked Sep 22 '13 at 18:08

bl0b

vote

1 answer

How to Block Web Crawler from Downloading File

Is it possible to block web crawler from downloading files (like zip file) in my server? I supposed to create a PHP script using cookies to track visitors specially web crawlers to login/register after downloading 3 files. But I found out that web…

security spam-filter web-crawler

asked Jul 27 '13 at 14:35

jaYPabs

vote

0 answers

How to Exclude Log Using Fail2Ban logpath with Wildcard Settings

I'm using wildcard in the logpath value as shown below: [http-get-dos] enabled = true filter = http-get-dos logpath = /var/log/ispconfig/httpd/*/access.log maxretry = 250 findtime = 300 #ban for 10 hours bantime = 36000 action =…

linux iptables monitoring fail2ban web-crawler

asked Jul 24 '13 at 12:05

jaYPabs

vote

1 answer

How to ban web crawler using fail2ban

I am using nginx and I am always hit by web crawler if I am correct. I tried to configure fail2ban but the IP address cannot be detected by fail2ban. The reason that it is not detected because it seems that it is a legitimate visitor. Here's the…

ubuntu nginx security web-crawler

asked Jul 19 '13 at 17:22

jaYPabs

vote

1 answer

Googlebot repeatedly looks for files that aren't on my server

I'm hosting a site for a volunteer organization. I've moved the site to WordPress, but it wasn't always that way. I suspect at one point it was hacked badly. My Apache error log file has grown to 122 kB in just the past 18 hours. The large…

apache-2.2 logging web-crawler googlebot

asked Nov 13 '12 at 02:51

John

vote

2 answers

How to create a global robots.txt that gets appended to each domain's own robots.txt on Apache?

I know can create ONE robots.txt file for all domains on an Apache server*, but I want to append to each domain's (if pre-existing) robots.txt. I want some general rules in place for all domains, but I need to allow different domains to have their…

apache-2.2 robots.txt web-crawler

asked Nov 02 '12 at 22:05

Gaia

1,855
5
34
60

vote

3 answers

How to block this URL pattern in Varnish VCL?

My website is getting badly hit by spambots and scrappers. I've used Cloudflare but the problem still remains there. The problem is spambots accessing non-existing urls causing a lot of load to my drupal backend which goes all the way and…

spam varnish web-crawler

asked Oct 27 '12 at 14:06

iTech

vote

3 answers

Blocking 'good' bots in nginx with multiple conditions for certain off-limits URL's where humans can go

After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.txt file (even after…

nginx web-crawler

asked Apr 25 '12 at 12:15

Glenn Plas

vote

1 answer

Does a forward web proxy exist that checks and obeys robots.txt on remote domains?

Does there exist a forward proxy server that will lookup and obey robots.txt files on remote internet domains and enforce them on behalf of requesters going via the proxy? e.g. Imagine a website at www.example.com that has a robots.txt file that…

http-proxy robots.txt web-crawler web-proxy

asked Jan 03 '12 at 17:28

wodow

vote

1 answer

How big would a MySQL database be if I save all webpages' title and URL in it?

For learning purposes, I want to make a simple web indexer which crawls the web and saves all found pages in a MySQL database with their titles and URLs, with this table (the page's content is not saved): id: integer AUTO_INCREMENT PRI title:…

mysql web-crawler

asked Oct 30 '10 at 23:01

user42235

vote

3 answers

Should I ban spiders?

A rails template script that I've been looking at automatically adds User-Agent: and Dissalow: in robots.txt thereby banning all spiders from the site What are the benefits of banning spiders and why would you want to?

html web-crawler robots.txt

asked Oct 04 '10 at 17:47

marflar

vote

1 answer

Yahoo AdCrawler hammering our site

Yahoo AdCrawler is re-trying some URLs repeatedly. The URLs are being given a 302 response code, so I suppose Yahoo should come back and try again "later", but "later" in my book doesn't mean that 7 specific URLs should be hit 3,000 times a day…

web-crawler

asked Sep 21 '10 at 16:40

Kristen

vote

2 answers

Copy a website and preserve the file & folder structure

I have an old web site running on an ancient version of Oracle Portal that we need to convert to a flat-html structure. Due to damage to the server we are not able to access the administrative interface, and even if we could there is no export…

web web-crawler

asked Jun 11 '10 at 08:40

DrStalker

6,946
24
79
107

vote

3 answers

Is it worthwhile to block malicious crawlers via iptables?

I periodically check my server logs and I notice a lot of crawlers search for the location of phpmyadmin, zencart, roundcube, administrator sections and other sensitive data. Then there are also crawlers under the name "Morfeus Fucking Scanner" or…

security iptables malicious web-crawler

asked May 11 '10 at 17:41

aardbol

1,473
4
17
26

Prev 1 2

4 5 6 7 Next