Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
14
votes
2 answers

Web Crawler - Ignore Robots.txt file?

Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python.
Craig Locke
  • 755
  • 4
  • 8
  • 12
14
votes
2 answers

Replay a Scrapy spider on stored data

I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to…
del
  • 6,341
  • 10
  • 42
  • 45
14
votes
4 answers

How do I get the destination URL of a shortened URL using Ruby?

How do I take this URL http://t.co/yjgxz5Y and get the destination URL which is http://nickstraffictricks.com/4856_how-to-rank-1-in-google/
Nick
  • 241
  • 3
  • 5
14
votes
4 answers

How to make Scrapy show user agent per download request in log?

I am learning Scrapy, a web crawling framework. I know I can set USER_AGENT in settings.py file of the Scrapy project. When I run the Scrapy, I can see the USER_AGENT's value in INFO logs. This USER_AGENT gets set in every download request to the…
Alok
  • 7,734
  • 8
  • 55
  • 100
14
votes
5 answers

crawl site that has infinite scrolling using python

I have been doing research and so far I found out the python package that I will plan on using its scrapy, now I am trying to find out what is a good way to build a scraper using scrapy to crawl site with infinite scrolling. After digging around I…
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
14
votes
3 answers

crawl links of sitemap.xml through wget command

I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond: Remote file exists but does not contain any link -- not retrieving. But for sure the sitemap.xml is full of…
dohomi
  • 171
  • 1
  • 7
14
votes
0 answers

Why is google not using a headless browser to crawl clientside content?

I'm aware of the steps it takes to make a client side website crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?hl=nl I just wonder, why isn't Google just integrating a headless browser in their crawlers to save…
Christoph
  • 26,519
  • 28
  • 95
  • 133
14
votes
1 answer

Does the url order matter in a XML sitemap?

For search engines and website crawlers, does the url order matter in a XML sitemap? Currently when the sitemap is generated, I order the website urls sequentially using a unique id, in the database. Should I order the urls in date…
stukelly
  • 4,257
  • 3
  • 37
  • 44
14
votes
5 answers

Anybody knows a good extendable open source web-crawler?

The crawler needs to have an extendable architecture to allow changing the internal process, like implementing new steps (pre-parser, parser, etc...) I found the Heritrix Project (http://crawler.archive.org/). But there are other nice projects like…
Zanoni
  • 30,028
  • 13
  • 53
  • 73
13
votes
1 answer

why facebook is flooding my site?

Every hour and a half Im getting a flood of requests from http://www.facebook.com/externalhit_uatext.php. I know what theses requests should mean, but this behavior is very odd. On a regular basis (aproximatedly every 1,5 hour), Im getting dozen of…
Leo Germani
  • 320
  • 1
  • 2
  • 8
13
votes
3 answers

Is Erlang the right choice for a webcrawler?

I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts…
Thomas
  • 10,289
  • 13
  • 39
  • 55
13
votes
1 answer

dynamic start_urls in scrapy

I'm using scrapy to crawl multiple pages on a site. The variable start_urls is used to define pages to be crawled. I would initially start with 1st page, thus defining start_urls = [1st page] in the file example_spider.py Upon getting more info…
Harry
  • 570
  • 2
  • 10
  • 19
13
votes
2 answers

What is the best Open Source Web Crawler Tool written in Java?

What is the best Open Source Web Crawler Tool, written in Java.
cuneytykaya
  • 579
  • 1
  • 5
  • 14
13
votes
2 answers

What is the "Bytespider" user agent?

Sample user agent String: Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.1511.1269 Mobile Safari/537.36; Bytespider Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X)…
Gokula Kannan
  • 209
  • 1
  • 3
  • 7
13
votes
3 answers

is there any java script web crawler framework

Is there any JavaScript web crawler framework?
saleh Hosseinkahni
  • 497
  • 2
  • 6
  • 17