Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
2
votes
2 answers

Web Scraping with Javascript?

I'm having a hard time figuring out how to scrape this webpage to get this wedding list into my onepager. It doesn't seem complicated at first but when I get into the code, I just can't get any results. I've tried ygrab.js, which was fairly simple…
Elydee
  • 41
  • 5
2
votes
1 answer

how to make Python3 web scraping program deal with local cookies?

I tried to write a program that can automatically down files (with php links). However, I have two issue right now First, my target website requires registration for the first time access. Then, every time when I clicked the download link, it…
2
votes
1 answer

Running Scrapy for multiple times on same URL

I'd like to crawl a certain url which returns a random response each time it's called. Code below returns what I want but I'd like to run it for long time so that I can use the data for an NLP application. This code only runs for once with scrapy…
Computa
  • 88
  • 7
2
votes
2 answers

Replacement of Method has_key in Python3

I want to change this def has_class_but_no_id(tag): return tag.has_key('class') and not tag.has_key('id') This function is from Python2 not for Python3 I had idea that I changed this HTML document in a list like this list_of_descendants =…
Tae
  • 23
  • 1
  • 5
2
votes
1 answer

StormCrawler's archetype topology does not fetch outlinks

From my understanding the basic example should be able to crawl and fetch pages. I followed the example on http://stormcrawler.net/getting-started/ but the crawler seems to only fetch a few pages and then does nothing more. I wanted to crawl…
2
votes
3 answers

Count the number of pages in a site

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?
Gaurav Sharma
  • 4,032
  • 14
  • 46
  • 72
2
votes
1 answer

Python - crawl from an image in html (of which the source code is actually a paragraph)

I'm trying to crawl data from the following image on a website, And the source code of the corresponding image is shown below: I want to use Python to extract the data from the image and make it readable. However, as the structure of the source…
Chianti5
  • 243
  • 1
  • 2
  • 11
2
votes
1 answer

X509 Certificate Exception while crawling some urls with StormCrawler

I have been using StormCrawler to crawl websites. As https protocol, I set default https protocol in StormCrawler. However, when I crawl some websites I am receiving below exception: Caused by:…
isspek
  • 133
  • 1
  • 11
2
votes
2 answers

How can i know the geographic origin from a web page or URL?

i'm building a web crawler and i'm trying to figure out where is a web page from. I mean, i can check the domain (for example, .com.ar ar from Argentina) but there are other sites, that have other domains (.com, .net) that are argentinean too, an…
santiagobasulto
  • 11,320
  • 11
  • 64
  • 88
2
votes
1 answer

Callback function isn't fired in crawler, scrapy

I need to use my function parsePage as the callback to request links I crawled from the website. However, the request is sent only once to the first link, and I got no response. Here is my code: class diploma(CrawlSpider): name =…
Konstantin
  • 129
  • 4
  • 16
2
votes
2 answers

Scrapy - Scrape sitemap with LinkExtractor

How would you scrape a sitemap URL with a LinkExtractor? http://www.example.com/ 2005-01-01
Maxime De Bruyn
  • 889
  • 7
  • 19
2
votes
1 answer

Web crawler recursively BeautifulSoup

I am trying to recursively crawl a Wikipedia url for all English article links. I want to perform a depth first traversal of n but for some reason my code is not recurring for every pass. Any idea why? def crawler(url, depth): if depth == 0: …
user42967
  • 99
  • 4
  • 14
2
votes
1 answer

StormCrawler maven packaging error

I am trying to set up and run Storm Crawler and follow http://digitalpebble.blogspot.co.uk/2017/04/crawl-dynamic-content-with-selenium-and.html blog post. The set of resources and configuration for StormCrawler are on my computer in…
Deividas Duda
  • 123
  • 1
  • 8
2
votes
0 answers

scrapy Key Error title

i am new to python and especially to scrapy. I wanted to make a spider which gives me all the comments from a reddit page. it finds the comments but it does not save them to a .csv file. Here is my spider: import scrapy from scrapy.spiders…
Torb
  • 259
  • 2
  • 12
2
votes
0 answers

Interactive selenium handler is not called for every request in Nutch

I am trying to use Nutch 1.14 for crawling a website. There are some web pages on which content is loaded through ajax. I am trying to integrate interactive selenium plugin to handle some js functionality to fetch dynamic data. As per documentation,…
Rajeev
  • 4,762
  • 8
  • 41
  • 63