Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
16
votes
1 answer

Scrapy Vs Nutch

I am planning to use webcrawling in an application i am currently working on. I did some research on Nutch and run some preliminary test using it. But then i came across scrapy. But when i did some preliminary research and went through the…
Vidhu
  • 193
  • 1
  • 2
  • 9
16
votes
12 answers

is it possible to write web crawler in javascript?

I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page
Ashwin Mendon
  • 231
  • 1
  • 3
  • 12
15
votes
4 answers

HtmlAgilityPack & Selenium Webdriver returns random results

I'm trying to scrape product names from a website. Oddly, I seem to only scrape random 12 items. I've tried both HtmlAgilityPack and with HTTPClient and I get the same random results. Here's my code for HtmlAgilityPack: using…
15
votes
5 answers

Get Scrapy crawler output/results in script file function

I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or…
Ahsan aslam
  • 1,149
  • 2
  • 16
  • 35
15
votes
5 answers

Scrapy - how to identify already scraped urls

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor.
Avinash
  • 583
  • 2
  • 6
  • 19
15
votes
4 answers

Crawling and Scraping iTunes App Store

I noticed that iTunes preview allows you to crawl and scrape pages via the http:// protocol. However, many of the links are trying to be opened in iTunes rather than the browser. For example, when you go to the iBooks page, it immediately tries…
Senseful
  • 86,719
  • 67
  • 308
  • 465
15
votes
2 answers

Parse HTML content in VBA

I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th…
Tdev
  • 183
  • 1
  • 2
  • 11
15
votes
1 answer

AngularJS SEO using HTML5 mode: Would love some clarity on how this functions behind-the-scenes

There are numerous resources out there for implementing SEO-friendly versions of AngularJS applications, of course. Despite reading all of them numerous times, I'm still a bit unclear on a few things, particularly regarding the distinction between…
J. Ky Marsh
  • 2,465
  • 3
  • 26
  • 32
15
votes
1 answer

Chrome Devtools: Save specific requests in Network Tab

Can I save just specific requests in the Chrome Devtools Network tab? It would be very useful to me since our company uses web crawling to fetch info from extranets, and the most I can do is to record (with the rec button) all the requests made to…
luis.ap.uyen
  • 1,314
  • 1
  • 11
  • 29
15
votes
8 answers

Do Google's crawlers interpret Javascript? What if I load a page through AJAX?

When a user enters my page, I have to make another AJAX call...to load data inside a div. That's just how my application works. The problem is...when I view the source of this code, it does not contain the source of that AJAX. Of course, when I do…
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
15
votes
6 answers

How to crawl billions of pages?

Is it possible to crawl billions of pages on a single server?
gpow
  • 711
  • 3
  • 8
  • 18
15
votes
1 answer

Obtaining static HTML files from Wikipedia XML dump

I would like to be able to obtain relatively up-to-date static HTML files from the enormous (even when compressed) English Wikipedia XML dump file enwiki-latest-pages-articles.xml.bz2 I downloaded from the WikiMedia dump page. There seem to be quite…
Brian Schmitz
  • 1,023
  • 1
  • 10
  • 19
14
votes
1 answer

How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

We have a system written with scrapy to crawl a few websites. There are several spiders, and a few cascaded pipelines for all items passed by all crawlers. One of the pipeline components queries the google servers for geocoding addresses. Google…
aniketd
  • 385
  • 1
  • 3
  • 15
14
votes
3 answers

How is an aggregator built?

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that? Have a spider/crawler who will crawl the web for finding the information I need (how would I…
Mircea
14
votes
2 answers

Where to store web crawler data?

I have a simple web crawler that starts at root (given url) downloads the html of the root page then scans for hyperlinks and crawls them. I currently store the html pages in an SQL database. I am currently facing two problems: It seems like the…
Mike G
  • 4,829
  • 11
  • 47
  • 76