Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
21
votes
5 answers

What are some good Ruby-based web crawlers?

I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby. Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part…
Jordan Dea-Mattson
  • 5,791
  • 5
  • 38
  • 53
21
votes
4 answers

Scrapy: HTTP status code is not handled or not allowed?

I want to get product title,link,price in category https://tiki.vn/dien-thoai-may-tinh-bang/c1789 But it fails "HTTP status code is not handled or not allowed": My file: spiders/tiki.py import scrapy from scrapy.linkextractors import…
gait
  • 331
  • 1
  • 3
  • 11
21
votes
3 answers

Writing items to a MySQL database in Scrapy

I am new to Scrapy, I had the spider code class Example_spider(BaseSpider): name = "example" allowed_domains = ["www.example.com"] def start_requests(self): yield self.make_requests_from_url("http://www.example.com/bookstore/new") …
Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313
21
votes
4 answers

Strategy for how to crawl/index frequently updated webpages?

I'm trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store and index their frontpage or any of their main pages, then within hours my…
OdieO
  • 6,836
  • 7
  • 56
  • 88
20
votes
8 answers

Scrapy - logging to file and stdout simultaneously, with spider names

I've decided to use the Python logging module because the messages generated by Twisted on std error is too long, and I want to INFO level meaningful messages such as those generated by the StatsCollector to be written on a separate log file while…
goh
  • 27,631
  • 28
  • 89
  • 151
20
votes
5 answers

An alternative web crawler to Nutch

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is: using Nutch as the web crawler, using Solr as the search engine, the front-end and the site logic is coded with…
wassimans
  • 8,382
  • 10
  • 47
  • 58
20
votes
10 answers

Language/libraries for downloading & parsing web pages?

What language and libraries are suitable for a script to parse and download small numbers of web resources? For example, some websites publish pseudo-podcasts, but not as proper RSS feeds; they just publish an MP3 file regularly with a web page…
Bennett McElwee
  • 24,740
  • 6
  • 54
  • 63
20
votes
6 answers

Can I block search crawlers for every site on an Apache web server?

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed. Is there a way I can modify my httpd.conf on the staging server to block…
Nick Messick
  • 3,202
  • 3
  • 30
  • 41
20
votes
3 answers

How do Scrapy rules work with crawl spider

I have hard time to understand scrapy crawl spider rules. I have example that doesn't work as I would like it did, so it can be two things: I don't understand how rules work. I formed incorrect regex that prevents me to get results that I need. OK…
Vy.Iv
  • 829
  • 2
  • 8
  • 17
20
votes
3 answers

Python-Requests (>= 1.*): How to disable keep-alive?

I'm trying to program a simple web-crawler using the Requests module, and I would like to know how to disable its -default- keep-alive feauture. I tried using: s = requests.session() s.config['keep_alive'] = False However, I get an error stating…
Acemad
  • 3,241
  • 3
  • 23
  • 29
20
votes
12 answers

Prevent site data from being crawled and ripped

I'm looking into building a content site with possibly thousands of different entries, accessible by index and by search. What are the measures I can take to prevent malicious crawlers from ripping off all the data from my site? I'm less worried…
yoavf
  • 20,945
  • 9
  • 37
  • 38
20
votes
1 answer

Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file. I have searched all over the web…
Sammy
  • 877
  • 1
  • 10
  • 23
19
votes
4 answers

Getting value after button click with BeautifulSoup Python

I'm trying to get a value that is given by the website after a click on a button. Here is the website: https://www.4devs.com.br/gerador_de_cpf You can see that there is a button called "Gerar CPF", this button provides a number that appears after…
user6866656
19
votes
4 answers

How can I handle Javascript in a Perl web crawler?

I would like to crawl a website, the problem is, that its full of JavaScript things, such as buttons and such that when they are pressed, they do not change the URL, but the data on the page is changed. Usually I use LWP / Mechanize etc to crawl…
snoofkin
  • 8,725
  • 14
  • 49
  • 86
19
votes
4 answers

Is Scrapy single-threaded or multi-threaded?

There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS. Does it mean, that Scrapy crawler is multi-threaded? So if I run scrapy crawl my_crawler it will literally fire multiple simultaneous requests in parallel? Im asking because,…
Gill Bates
  • 14,330
  • 23
  • 70
  • 138