Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
2
votes
2 answers

Anyone know of an open source web spider?

I'm looking for a web spider that will collect all the links it sees, save those to a file, and then index those after finishing the others it has indexed. It doesn't have to have a pretty UI or really anything. As long as it can jump from website…
Noah R
  • 5,287
  • 21
  • 56
  • 75
2
votes
1 answer

Puppeteer proxy on Heroku

I am looking for a solution to anonymous a puppeteer web crawler deployed on Heroku. Locally I have Tor running on my local computer and re-routing traffic through the '127.0.0.1', 9050 port. Now I can not reproduce this on my Heroku app after…
Pimmesz
  • 335
  • 2
  • 8
  • 29
2
votes
0 answers

Python (beautiful-soup) returning "none" for existing html while crawling

i simply want to get the html of search bar of https://www.daraz.com.pk website. i have wrote a code and tried it on "https://www.amazon.com", "https://www.alibaba.com", "https://www.goto.com.pk" and many others, it works fine. but its not working…
2
votes
1 answer

Solr query string not working for full text searches

I'm following this tutorial on how to perform indexing on sample documents using Solr. The default collection is "gettingstarted" as shown. Now I'm trying to query it. There are 52 entries as shown: However, when I replace the q argument with say…
Ajay H
  • 794
  • 2
  • 11
  • 28
2
votes
1 answer

Will my pages be crawled by google as they are Markdown files asynchronously interpreted and injected into the DOM?

I am planning to start a blog so I created my own laravel website. My posts are markdown files with .md extension. When a user visits a post eg. example.com/how-to-create-a-webiste then my markdown file is fetched and parsed to generate html content…
Raj
  • 1,928
  • 3
  • 29
  • 53
2
votes
4 answers

How to get all td[3] tags from the tr tags with selenium Xpath in python

I have a webpage HTML like this: …
iman_sh77
  • 77
  • 1
  • 11
2
votes
1 answer

Crawling table with scrapy, site has unusual html code.

first post. I appreciate any guidance, and cant wait to give back to the community. I am trying to make a crawler using scrapy, to collect data from this table. http://www.wikicfp.com/cfp/call?conference=machine%20learning Specifically, the…
2
votes
2 answers

(Python, Selenium) Is it possible to get text list only if the attribute meets criteria?

Not sure I made my point in the title. Let's the sorce code first.
1

Jeong In Kim
  • 373
  • 2
  • 12
2
votes
0 answers

Indexing and Crawling using nutch with solr

Am newbie to nutch and solr, I trying to index and crawl a single website using Nutch with solr but am getting this error., I don't know what is the exact error. can anyone help me with this ? Thanks in advance. Segment dir is complete:…
Vinod kumar
  • 87
  • 1
  • 12
2
votes
2 answers

failing at downloading an image with "urllib.request.urlretrieve" in Python

If possible point the solution as well My Code: import random import urllib.request def download_web_image(url): name = random.randrange(1,1000) fullname = str(name) + ".jpg" urllib.request.urlretrieve(url,…
2
votes
0 answers

Scrapy timeout after a while

I work on crawling the text from https://www.dailynews.co.th, and here is my question. My spider worked almost perfect at first, crawled about 4000 pages. 2018-09-28 20:05:00 [scrapy.extensions.logstats] INFO: Crawled 4161 pages (at 0 pages/min),…
CKLu
  • 68
  • 7
2
votes
1 answer

How do I set the content of a tag from Node or Express?

I'm building a search engine for a real estate company called Signature REP. I'm trying to use Facebook's Open Graph API following the Sharing for Webmaster's guide and what I'm trying to do is get this image for this website to show up as the image…
ihodonald
  • 745
  • 1
  • 12
  • 27
2
votes
1 answer

Issue Crawling Amazon, Element Cannot Be Scrolled into View

I'm having an issue crawling pages on Amazon. I've tried using: Executing JS Script Action Chains Explicit Waits Nothing seems to work. Everything throws one exception or error or another. Base Script ff =…
oldboy
  • 5,729
  • 6
  • 38
  • 86
2
votes
4 answers

Getting wikipedia abstracts only

I have searched around but not gotten much help. Here's my problem. I want to start from a portal page on wikipedia, say Computer_science and go its categories pages. There are some pages in that category and there are links to subcategories. I will…
Sanjeev Satheesh
  • 424
  • 5
  • 17
2
votes
1 answer

How to find the current start_url in Scrapy CrawlSpider?

When running Scrapy from an own script that loads URLs from a DB and follows all internal links on those websites, I encounter a pitty. I need to know which start_url is currently used as I have to maintain consistency with a database (SQL DB). But:…
junkmaster
  • 141
  • 1
  • 11