Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

More on Wikipedia

98 questions
0
votes
1 answer

Is it possible to block HTTP traffic from specific machines?

I have some web crawlers, and a specific website seems to be blocking traffic temporarily after some time. The thing is, even though all clients have the same external IP address (they access the internet via the same gateway) it blocks specific…
Doug
  • 239
  • 2
  • 6
0
votes
2 answers

Switching between multiple authentication types on same URL

I have a secure SSO site that uses Shibboleth authentication and SAML identity provider. I need to allow a Google Search Appliance crawler to come index the URL's. I have a requirement to change on HTTP request from SAML to Basic authentication…
0
votes
1 answer

Tool or website to check links

We supply Magento and Typo3 installations to customers. To improve QA we wanted to use an automatic link checker to check for broken and/or outdated links - automatically. We want to check all links staying inside it's own domain, and maybe links…
Dabu
  • 359
  • 1
  • 5
  • 23
0
votes
1 answer

Referrer in access.log is a directory

It seems that the referrer on the following log is a folder. 112.200.208.5 - - [29/Jul/2013:20:43:14 +0800] "GET /sites/default/files/download/argie/pos-code.zip HTTP/1.1" 206 294677 "http://www.mysite.com/sites/default/files/download/argie/"…
jaYPabs
  • 299
  • 1
  • 4
  • 20
0
votes
1 answer

Sharepoint Crawler is denied access to sites

We create all our site collections programatically with a custom site def/template. Everything works as expected, except for the crawler. It's apparently denied access to the sites. The crawl logs…
noocyte
  • 194
  • 10
0
votes
1 answer

Is there a chance to block images spiders / bots on dedicated servers without using robots.txt or .htaccess?

We know that we can block certain spiders from crawling websites pages using robots.txt or .htaccess or maybe via the Apache configuration File httpd.conf. But that requires to edit may be a large number of sites on some dedicated servers and bots…
hsobhy
  • 181
  • 1
  • 2
  • 10
0
votes
1 answer

How do I scan my folders for a website? Like a crawler?

I'd like to scan all the url's on my website as well as get the files in them, but the thing is, there are too many for me to do this manually so how would I do this? I'd like it formatted anyway as long as there is some type of order to it. Eg:…
user151015
0
votes
1 answer

Webmaster randomly reporting a massive increase in 404s (apparently from old sitemaps)

Well, I'm stumped. Several months back, we launched a totally new website, replacing a legacy system that was pretty messy. Part of the mess was many, many pages created that really didn't need to be there or be crawled by Google. There was a lot of…
0
votes
1 answer

What is "/admin/Y-ivrrecording.php?php=info&ip=uname"?

Help me please, I found an IP address from Korea try to do something with my web server by put this '/admin/Y-ivrrecording.php?php=info&ip=uname' like search a filename from my web server. I don't know any reason and any knowledge about this. i try…
0
votes
2 answers

Methods to prevent malicious crawlers/scrapers and DDoS Attacks

From last couple of weeks I have been experiencing bot attacks on my site. Basically crawlers are running on the site at a high frequency rate resulting in load increase. This results in bandwidth consumption and thus poor user experience for rest…
bilkulbekar
  • 101
  • 2
0
votes
1 answer

Any good webcrawler besides DRKSpider

I was having a look at DRKSpider to find problems with a website in our production server, but it seems its export feature generates different outputs (with different content). My goal is to find a good tool that shows every type of status code that…
Junior Mayhé
  • 185
  • 1
  • 10
0
votes
2 answers

Can access web application from browser but crawler application throws 404 erorr?

I am using an application called Xenu Link Sleuth to try and find broken links on a site we host. When I go to the site through a browser it pops right open. When I try to run it through Xenu it immediately throws a 404 not found error. I…
Abe Miessler
  • 925
  • 4
  • 11
  • 20
0
votes
1 answer

how to establish connection between drupal and solr

I am working on technologies like drupal & solr.I have installed all required modules.but i need to know that how to crawl data from drupal.& how to form connector between drupal & solr.
netra
  • 1
  • 1
0
votes
1 answer

Web log file analysis software to measure search crawlers

I need to analyze the search engine crawling going on in my site. Is there a good tool for this? I've tried AWStats and Sawmill. But both of those give me very limited insight into the crawling. I need to know information like how many…
apptree
  • 345
  • 1
  • 3
  • 10
0
votes
1 answer

Can I use a Google Appliance/Mini to crawl and index sites I don't own?

Maybe this is a stupid question, but... I am working with this company and they said they needed to get "permission" to crawl other people's sites. They have a Google Search Appliance And some Google Minis and want to point them at other sites to…
John B
  • 171
  • 1
  • 11