Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
10
votes
2 answers

Reverse search an image in Yandex Images using Python

I'm interested in automatizing reverse image search. Yandex in particular is great for busting catfishes, even better than Google Images. So, consider this Python code: import requests import webbrowser try: filePath =…
Platon Makovsky
  • 275
  • 3
  • 13
10
votes
4 answers

Is there a hashing algorithm that is tolerant of minor differences?

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something…
Jason Baker
  • 192,085
  • 135
  • 376
  • 510
10
votes
2 answers

Trying to get Scrapy into a project to run Crawl command

I'm new to Python and Scrapy and I'm walking through the Scrapy tutorial. I've been able to create my project by using DOS interface and typing: scrapy startproject dmoz The tutorial later refers to the Crawl command: scrapy crawl dmoz.org But…
Adam Smith
  • 103
  • 1
  • 1
  • 5
10
votes
1 answer

Is there CURRENTLY anyway to fetch Instagram user media without authentication?

Until recently there were several ways to retrieve Instagram user media without the need for API authentication. But apparently, the website stopped all of them. Some of the old…
Moradnejad
  • 3,466
  • 2
  • 30
  • 52
10
votes
1 answer

scrapyd-client command not found

I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme…
dropax
  • 125
  • 1
  • 8
10
votes
2 answers

How can use scrapy shell with url and basic auth credentials?

I want to use scrapy shell and test response data for url which requires basic auth credentials. I tried to check scrapy shell documentation but I couldn't find it there. I tried with scrapy shell 'http://user:pwd@abc.com' but it didn't work. Does…
Rohanil
  • 1,717
  • 5
  • 22
  • 47
10
votes
6 answers

How to check if content of webpage has been changed?

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later. I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character,…
Savad KP
  • 1,625
  • 3
  • 28
  • 40
10
votes
2 answers

NodeJS x-ray web-scraper: how to follow links and get content from sub page

So I am trying to scrape some content with node.js x-ray scraping framework. While I can get the content from a single page I can't get my head around on how to follow links and get content from a subpage in one go. There is a sample on x-ray github…
Ales Maticic
  • 1,895
  • 3
  • 13
  • 27
10
votes
2 answers

scrapy crawler caught exception reading instance data

I am new to python and want to use scrapy to build a web crawler. I go through the tutorial in http://blog.siliconstraits.vn/building-web-crawler-scrapy/. Spider code likes following: from scrapy.spider import BaseSpider from scrapy.selector…
printemp
  • 869
  • 1
  • 10
  • 33
10
votes
1 answer

Scrapy delay request

every time i run my code my ip gets banned. I need help to delay each request for 10 seconds. I've tried to place DOWNLOAD_DELAY in code but it gives no results. Any help is appreciated. # item class included here class…
Arkan Kalu
  • 403
  • 2
  • 4
  • 16
10
votes
1 answer

How to run apache nutch different jobs in parallel manner

I am using nutch 2.3. All jobs run one after other i.e. first generator, fetch, parse, index etc. I want to run some jobs simultaneously. I know some jobs cannot run in parallel but other can e.g parse job, dbupdate, indexjob should be run with…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
10
votes
1 answer

how to resume wget mirroring website?

I use wget to download an entire website. I used the follwing command (in windows 7): wget ^ --recursive ^ -A "*thread*, *label*" ^ --no-clobber ^ --page-requisites ^ --html-extension ^ --domains example.com ^ --random-wait ^ --no-parent ^ …
10
votes
5 answers

cant set Host in CURL PHP

I am unable to set the host in curl. It still shows as localhost if i use the following code function wget($url) { $agent= 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0.1'; $curlHeaders =…
dharanbro
  • 1,327
  • 4
  • 17
  • 40
10
votes
7 answers

Crawl specific pages and data and make it searchable

Important note: the questions below aren't meant to break ANY data copyrights. All crawled and saved data is being linked directly to the source. For a client I'm gathering information for building a search engine/web spider combination. I do have…
Joshua - Pendo
  • 4,331
  • 6
  • 37
  • 51
10
votes
1 answer

How to prevent Scrapy from URL encoding request URLs

I would like Scrapy to not URL encode my Requests. I see that scrapy.http.Request is importing scrapy.utils.url which imports w3lib.url which contains the variable _ALWAYS_SAFE_BYTES. I just need to add a set of characters to _ALWAYS_SAFE_BYTES but…
flyingtriangle
  • 103
  • 1
  • 5