Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

More on Wikipedia

98 questions
-2
votes
1 answer

How to fix the correct html file as homepage?

A search for my domain shows defaulthomepage along with html file that my hosting service, using plesk 9.5.4, provides as a placeholder. I have since deleted the old index file and added a new index.html file, but google result still shows this…
-2
votes
1 answer

Whats the best server-language for programing a Webcrawler?

I would like to ask what language: ASP.NET / Ruby / CGI / Perl / Python / ColdFusion... would be the bes for programming a Webcrawler and for processing the contained information???? (it should be used for Data-Mining) Fastest at runtime?…
-3
votes
1 answer

How to gather in a save, cheap and easy way high quality entropy on a Linux machine?

When no radioactive decay is available and good entropy is strongly advised for security reasons you experience a real problem. HTTPS connections consume a lot of entropy. If you have thousands of them per hour between machines low on good entropy…
-3
votes
2 answers

Restricting Access from BOTS

I would like to protect my server from too many hits from Bots. Considering a scenario, where in a server (physical) located in a private network and hitting my server continuously. Do i have a mechanism to identify the server behind the hits, say…
-4
votes
2 answers

what ip will logged in a website if I access a website via another website through my PC?

If http://example2.com sends cURL connection to a website called http://example1.com. If I access http://example2.com from my pc to see the content of http://example1.com, than would http://example1.com will logged my PC's ip address or…
-4
votes
1 answer

is there any web crawler can get access to users region for download?

Actually I am using Httrack as a web crawler, can it use my credentials to access members area and download the zip files because they are restricted from public access. Thank you in advance. Update: After all the problem was with my IPV6 it must…
M. A.
  • 97
  • 6
-4
votes
1 answer

Bot being redirected to Google.com when requests Myspace.com... what?

first time on Serverfault. I'm having a problem connecting to Myspace.com through my server. I've been using mechanize via Python to run a bot, (not spam, crawling for information on musicians) on a variety of websites. It's been working for weeks…
Artur Sapek
  • 103
  • 3
-8
votes
2 answers

Get all urls of a website

I wanna build a tool which scans a website for all urls, but not the urls within the page but of the site self, but I don't know how. Could anyone give me an example how I can start? Example: www.localhost.dev /upload /login …
chunk0r
  • 11
  • 4
1 2 3 4 5 6
7