Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
12
votes
3 answers

Building a geolocation photo index - crawling the web or relying on an existing API?

I'm developing a geo-location service which requires a photo per POI, and I'm trying to figure out how to match the right photo to a given location. I'm looking for an image that will give an overview for the location rather than some arbitrary…
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
12
votes
4 answers

Using one Scrapy spider for several websites

I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI. How do I (as simple as possible) create a spider…
Christian Davén
  • 16,713
  • 12
  • 64
  • 77
12
votes
1 answer

Is there a way to get all posts for a given subreddit instead of just the posts newer than one month?

Is there a way to get all posts for a given subreddit instead of just the posts newer than one month? For example, this is the "last" page of posts from IAmA subreddit we can get to, http://www.reddit.com/r/IAmA/?count=900&limit=100&after=t3_1k3tm1,…
shengmin
  • 319
  • 1
  • 3
  • 13
12
votes
2 answers

How to use Goutte

Issue: Cannot fully understand the Goutte web scraper. Request: Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information…
scrfix
  • 1,188
  • 3
  • 11
  • 24
12
votes
1 answer

How to print html source to console with phantomjs

I just downloaed and installed phantomjs on my machine. I copy and pasted the following script into a file called hello.js: var page = require('webpage').create(); var url = 'https://www.google.com' page.onLoadStarted = function () { …
toom
  • 12,864
  • 27
  • 89
  • 128
12
votes
1 answer

Is there a pagination links microdata entry?

There is a microdata for breadcrumb links: http://www.data-vocabulary.org/Breadcrumb/ But is there a similar microdata for page links, like: [<-] 3 4 5[prev] 6[current] 7[next] 8 9 10 11 [->]
alemjerus
  • 8,023
  • 3
  • 32
  • 40
12
votes
6 answers

.NET Custom Threadpool with separate instances

What is the most recommended .NET custom threadpool that can have separate instances i.e more than one threadpool per application? I need an unlimited queue size (building a crawler), and need to run a separate threadpool in parallel for each site…
Roey
  • 849
  • 2
  • 11
  • 20
11
votes
3 answers

How does Cloudflare differentiate Selenium and Requests traffic?

Context I am currently attempting to build a small-scale bot using Selenium and Requests module in Python. However, the webpage I want to interact with is running behind Cloudflare. My python script is running over Tor using stem module. My traffic…
ku8zi
  • 111
  • 1
  • 4
11
votes
4 answers

golang force net/http client to use IPv4 / IPv6

I' using go1.11 net/http and want to decect if a domain is ipv6-only. What did you do? I create my own DialContext because want I to detect if a domain is ipv6-only. code below package main import ( "errors" "fmt" "net" "net/http" …
fang jinxu
  • 319
  • 1
  • 3
  • 11
11
votes
2 answers

how to crawl all the internal url's of a website using crawler?

I wanted to use a crawler in node.js to crawl all the links in a website (internal links) and get the title of each page , i saw this plugin on npm crawler, if i check the docs there is the following example: var Crawler = require("crawler"); var c…
Alexander Solonik
  • 9,838
  • 18
  • 76
  • 174
11
votes
2 answers

Scrapy get all links from any website

I have the following code for a web crawler in Python 3: import requests from bs4 import BeautifulSoup import re def get_links(link): return_links = [] r = requests.get(link) soup = BeautifulSoup(r.content, "lxml") if…
Brandon Skerritt
  • 380
  • 2
  • 3
  • 12
11
votes
1 answer

Scrapy - Understanding CrawlSpider and LinkExtractor

So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name =…
ocean800
  • 3,489
  • 13
  • 41
  • 73
11
votes
6 answers

Any Good Open Source Web Crawling Framework in C#

Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process. I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and…
Sumit Ghosh
  • 3,264
  • 4
  • 40
  • 59
11
votes
4 answers

Bingpreview invalidates one time links in email

It seems that Outlook.com uses the BingPreview crawler to crawl links in emails. But the one-time links are marked as used/expired after opening the email and before the user gets the chance to use them. I try to add a rel="nofollow" in the but…
colas
  • 191
  • 7
11
votes
2 answers

How to disable robots.txt when you launch scrapy shell?

I use Scrapy shell without problems with several websites, but I find problems when the robots (robots.txt) does not allow access to a site. How can I disable robots detection by Scrapy (ignored the existence)? Thank you in advance. I'm not talking…
DARDAR SAAD
  • 392
  • 1
  • 3
  • 17