Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
62
votes
4 answers

Python: Disable images in Selenium Google ChromeDriver

I spend a lot of time searching about this. At the end of the day I combined a number of answers and it works. I share my answer and I'll appreciate it if anyone edits it or provides us with an easier way to do this. 1- The answer in Disable images…
1man
  • 5,216
  • 7
  • 42
  • 56
58
votes
9 answers

How do you archive an entire website for offline viewing?

We actually have burned static/archived copies of our asp.net websites for customers many times. We have used WebZip until now but we have had endless problems with crashes, downloaded pages not being re-linked correctly, etc. We basically need an…
jskunkle
  • 1,271
  • 3
  • 13
  • 24
48
votes
3 answers

Node.JS: How to pass variables to asynchronous callbacks?

I'm sure my problem is based on a lack of understanding of asynch programming in node.js but here goes. For example: I have a list of links I want to crawl. When each asynch request returns I want to know which URL it is for. But, presumably because…
Marc
  • 13,011
  • 11
  • 78
  • 98
46
votes
7 answers

Detecting honest web crawlers

I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against…
JavadocMD
  • 4,397
  • 2
  • 25
  • 23
45
votes
6 answers

How to programmatically fill input elements built with React?

I'm tasked with crawling website built with React. I'm trying to fill in input fields and submitting the form using javascript injects to the page (either selenium or webview in mobile). This works like a charm on every other site + technology but…
Timo Kauranen
  • 453
  • 1
  • 4
  • 5
44
votes
9 answers

Automated link-checker for system testing

I often have to work with fragile legacy websites that break in unexpected ways when logic or configuration are updated. I don't have the time or knowledge of the system needed to create a Selenium script. Besides, I don't want to check a specific…
ctford
  • 7,189
  • 4
  • 34
  • 51
44
votes
5 answers

How to find sitemap.xml path on websites?

How can I find sitemap.xml file of websites? e.g. Going to stackoverflow/sitemap.xml gets me a 404. In stackoverflow/robots.txt is written the following: "this technically isn't valid, since for some godforsaken reason sitemap paths must be…
jacktrades
  • 7,224
  • 13
  • 56
  • 83
43
votes
2 answers

Scrapy Python Set up User Agent

I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code: [settings] default = myproject.settings USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML,…
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
43
votes
5 answers

how to filter duplicate requests based on url in scrapy

I am writing a crawler for a website using scrapy with CrawlSpider. Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider. What I want to…
nik-v
  • 753
  • 1
  • 9
  • 20
42
votes
6 answers

how to extract links and titles from a .html page?

for my website, i'd like to add a new functionality. I would like user to be able to upload his bookmarks backup file (from any browser if possible) so I can upload it to their profile and they don't have to insert all of them manually... the only…
Toni Michel Caubet
  • 19,333
  • 56
  • 202
  • 378
40
votes
2 answers

how to totally ignore 'debugger' statement in chrome?

'never pause here' can not work after I continue: still paused
chen
  • 403
  • 1
  • 4
  • 6
38
votes
6 answers

How do I lock read/write to MySQL tables so that I can select and then insert without other programs reading/writing to the database?

I am running many instances of a webcrawler in parallel. Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain. Other parallel crawlers check the log table to see what…
T. Brian Jones
  • 13,002
  • 25
  • 78
  • 117
38
votes
8 answers

guide on crawling the entire web?

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) . I've come across a paper where this was done....but i cannot…
bohohasdhfasdf
  • 693
  • 2
  • 11
  • 16
36
votes
6 answers

How to identify web-crawler?

How can I filter out hits from webcrawlers etc. Hits which not is human.. I use maxmind.com to request the city from the IP.. It is not quite cheap if I have to pay for ALL hits including webcrawlers, robots etc.
clarkk
  • 27,151
  • 72
  • 200
  • 340
36
votes
4 answers

Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from…
yusuf
  • 3,591
  • 8
  • 45
  • 86