Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

Wiki Link

9683 questions

votes

1 answer

Scrapy Vs Nutch

I am planning to use webcrawling in an application i am currently working on. I did some research on Nutch and run some preliminary test using it. But then i came across scrapy. But when i did some preliminary research and went through the…

asked Jun 19 '13 at 19:14

Vidhu

votes

12 answers

is it possible to write web crawler in javascript?

I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page

javascript web-crawler

asked Jun 18 '12 at 13:04

Ashwin Mendon

votes

4 answers

HtmlAgilityPack & Selenium Webdriver returns random results

I'm trying to scrape product names from a website. Oddly, I seem to only scrape random 12 items. I've tried both HtmlAgilityPack and with HTTPClient and I get the same random results. Here's my code for HtmlAgilityPack: using…

c# selenium-webdriver web-scraping web-crawler html-agility-pack

asked Jul 21 '17 at 17:03

inquisitive_one

1,465
7
32
56

votes

5 answers

Get Scrapy crawler output/results in script file function

I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or…

python web-crawler twisted scrapy

asked Oct 25 '16 at 10:46

Ahsan aslam

1,149
2
16
35

votes

5 answers

Scrapy - how to identify already scraped urls

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor.

python web-crawler scrapy

asked Oct 06 '10 at 10:38

Avinash

votes

4 answers

Crawling and Scraping iTunes App Store

I noticed that iTunes preview allows you to crawl and scrape pages via the http:// protocol. However, many of the links are trying to be opened in iTunes rather than the browser. For example, when you go to the iBooks page, it immediately tries…

language-agnostic itunes screen-scraping web-crawler

asked Jun 23 '10 at 01:05

Senseful

86,719
67
308
465

votes

2 answers

Parse HTML content in VBA

I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th…

vba parsing excel html-parsing web-crawler

asked Aug 25 '14 at 14:53

Tdev

votes

1 answer

AngularJS SEO using HTML5 mode: Would love some clarity on how this functions behind-the-scenes

There are numerous resources out there for implementing SEO-friendly versions of AngularJS applications, of course. Despite reading all of them numerous times, I'm still a bit unclear on a few things, particularly regarding the distinction between…

javascript html angularjs seo web-crawler

asked May 01 '14 at 20:59

J. Ky Marsh

2,465
3
26
32

votes

1 answer

Chrome Devtools: Save specific requests in Network Tab

Can I save just specific requests in the Chrome Devtools Network tab? It would be very useful to me since our company uses web crawling to fetch info from extranets, and the most I can do is to record (with the rec button) all the requests made to…

http web-crawler google-chrome-devtools

asked Apr 09 '14 at 07:42

luis.ap.uyen

1,314
1
11
29

votes

8 answers

Do Google's crawlers interpret Javascript? What if I load a page through AJAX?

When a user enters my page, I have to make another AJAX call...to load data inside a div. That's just how my application works. The problem is...when I view the source of this code, it does not contain the source of that AJAX. Of course, when I do…

web-crawler

asked Jan 14 '10 at 02:44

TIMEX

259,804
351
777
1,080

votes

6 answers

How to crawl billions of pages?

Is it possible to crawl billions of pages on a single server?

web-crawler

asked Dec 20 '09 at 07:40

gpow

votes

1 answer

Obtaining static HTML files from Wikipedia XML dump

I would like to be able to obtain relatively up-to-date static HTML files from the enormous (even when compressed) English Wikipedia XML dump file enwiki-latest-pages-articles.xml.bz2 I downloaded from the WikiMedia dump page. There seem to be quite…

xml-parsing screen-scraping web-crawler mediawiki wikipedia

asked May 23 '12 at 04:12

Brian Schmitz

1,023
1
10
19

votes

1 answer

How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

We have a system written with scrapy to crawl a few websites. There are several spiders, and a few cascaded pipelines for all items passed by all crawlers. One of the pipeline components queries the google servers for geocoding addresses. Google…

python scrapy web-crawler

asked Mar 14 '12 at 09:25

aniketd

votes

3 answers

How is an aggregator built?

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that? Have a spider/crawler who will crawl the web for finding the information I need (how would I…

web-services aggregation web-crawler nutch

asked May 29 '09 at 22:36

Mircea

votes

2 answers

Where to store web crawler data?

I have a simple web crawler that starts at root (given url) downloads the html of the root page then scans for hyperlinks and crawls them. I currently store the html pages in an SQL database. I am currently facing two problems: It seems like the…

c# algorithm web-crawler

asked Jan 17 '12 at 01:00

Mike G

4,829
11
47
76

Prev 1 2 3

…

99 100 Next