Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
82
votes
7 answers

crawler vs scraper

Can somebody distinguish between a crawler and scraper in terms of scope and functionality.
Nayn
  • 3,594
  • 8
  • 38
  • 48
78
votes
2 answers

Search in html source with GOOGLE?

I have several websites, and I can't remember where I wrote some lines of code. As my pages are indexed by Google, I would like to know if Google offers a facility to search within the HTML source code/mark-up itself, instead of just allowing search…
Entretoize
  • 2,124
  • 3
  • 23
  • 44
75
votes
5 answers

PyPi download counts seem unrealistic

I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I…
jeffalstott
  • 2,643
  • 4
  • 28
  • 34
75
votes
8 answers

Designing a web crawler

I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it. How does it all begin from the beginning. Say Google started with some hub pages say…
74
votes
6 answers

python: [Errno 10054] An existing connection was forcibly closed by the remote host

I am writing python to crawl Twitter space using Twitter-py. I have set the crawler to sleep for a while (2 seconds) between each request to api.twitter.com. However, after some times of running (around 1), when the Twitter's rate limit not exceeded…
Nama Keru
  • 911
  • 1
  • 7
  • 10
74
votes
3 answers

getting Forbidden by robots.txt: scrapy

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/> ERROR: No response downloaded for: https://www.netflix.com/
deepak kumar
  • 743
  • 1
  • 5
  • 4
71
votes
3 answers

Spider a Website and Return URLs Only

I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a…
Rob Wilkerson
  • 40,476
  • 42
  • 137
  • 192
69
votes
4 answers

Click a Button in Scrapy

I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here. But…
naeg
  • 3,944
  • 3
  • 24
  • 29
68
votes
5 answers

How to do HTTP-request/call with JSON payload from command-line?

What's the easiest way to do a JSON call from the command-line? I have a website that does a JSON call to retrieve additional data. The Request Payload as shown in Google Chrome looks like this: {"version": "1.1",…
Roalt
  • 8,330
  • 7
  • 41
  • 53
67
votes
8 answers

Anyone know of a good Python based web crawler that I could use?

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the…
Matt
  • 3,356
  • 5
  • 21
  • 12
66
votes
6 answers

Python: maximum recursion depth exceeded while calling a Python object

I've built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info' I need. after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I'm…
YSY
  • 1,226
  • 3
  • 13
  • 19
64
votes
6 answers

Change IP address dynamically?

Consider the case, I want to crawl websites frequently, but my IP address got blocked after some day/limit. So, how can change my IP address dynamically or any other ideas?
Magendran V
  • 1,411
  • 3
  • 19
  • 33
64
votes
15 answers

How do I make a simple crawler in PHP?

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file. Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.
KJ Saxena
  • 21,452
  • 24
  • 81
  • 109
64
votes
10 answers

How to write a crawler?

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does…
Jason
63
votes
9 answers

Detect Search Crawlers via JavaScript

I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot. I have found an example of how to to detect a certain browser, but am unable to…
Jon
  • 8,205
  • 25
  • 87
  • 146