Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

More on Wikipedia

98 questions
0
votes
3 answers

Google APPS site - not being indexed

Help! Google crawlers appear to be visiting my site but it is not getting indexed, what am I doing wrong? Yahoo has managed to find the mydomain.appspot.com and has indexed successfully (albeit on the apspot address and not my domain) so I assume…
user47122
0
votes
2 answers

How should I interpret site analytics with 11 pageviews in an 3 second visit?

I'm using google analytics and recently i've noticed some weird trends going on. I have a lot of visits that last mere seconds but mark several page views... more than a normal human can see in that range of time. A specific case is that the only…
Juank
0
votes
1 answer

Getting web.archive.org to archive website again

I noticed that my website isn't archived anymore by web.archive.org. When I look at http://web.archive.org/web/*/http://www.cnn.com it is clearly visible that it stopped working in july 2008. web.archive.org has a 6-month-delay policy. This means…
None
0
votes
1 answer

Weird traffic behavior on Ubuntu server

top - 19:51:36 up 1 day, 12:27, 1 user, load average: 19.14, 11.33, 4.74 Tasks: 172 total, 18 running, 154 sleeping, 0 stopped, 0 zombie %Cpu(s): 90.0 us, 10.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 3924.0 total,…
0
votes
1 answer

DNS redirecting to mantainance page during planned mantainance - what happens to google indexing?

We are planning a mantainance that could take down the services for a whole day. I would like therefore to show a mantainance page, explaining the issue and providing additional info/links. During this time, the machines will be completelly down, so…
0
votes
0 answers

Can many connections cause dns lookup or request timeout?

I'm running crawler on my company's internet. 10 raspberry pi * 45 crawlers each, 2 desktops * 70 crawlers each These processes are sending requests 24/7. 3~5% of packets are getting lost. This is affecting queuing system heavily. I'm using my…
0
votes
1 answer

How to avoid emails sent to Google's deep web crawler

My website has an area restricted to users who sign up with a valid email. I have got requests with bogus emails, and I want to avoid sending emails to non-existent addresses lest they increase the bounce rate and hurt my sending reputation. The…
miguelmorin
  • 249
  • 1
  • 5
  • 13
0
votes
1 answer

Can missing HTTP referrer's identify web crawlers?

I am currently trying to analyze the traffic of a website. Besides specifics regarding the requested resource and timestamps, the tracking system only provides the request's HTTP referrer. In most instances the referrer is null. Given that the…
user600511
0
votes
2 answers

Strange behavior in Apache log

I have a Nextcloud server running on Apache, and disabled my firewall for about 5 minutes while I ran an apt-update. I decided to check the logs after, and found this from an unknown IP. It looks like it is trying to run some sort of script. Does…
-1
votes
2 answers

How to disallow crawling for all subdomains using my main domain's physical robots.txt file

I have multiple physical sub-domains and I don't want to change any robots.txt file of any of that sub-domains. Is there any way to disallow all the sub-domains from my main domain's physical robots.txt file without using any sub-domain's physical…
Aditya Shah
  • 101
  • 3
-1
votes
1 answer

40.96.18.165 keeps visiting my web server

Something/someone from 40.96.18.165 has been hitting my web server exactly eight times a day everyday since Feb 5, 2017. The user agent used is Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0). Look ups will show the IP address is from…
Old Geezer
  • 397
  • 8
  • 25
-1
votes
0 answers

Is offering the contents of a third party web site offline violating the law?

I have developed a nice little app that crawls a bunch of newspaper web sites and makes their latest content available on my phone offline. It's basically a Pocket app that saves contents automatically, once a day. I am wondering: if I ever wanted…
-1
votes
3 answers

Open Source Crawler

I came across an open source crawler that recently hit my site. I was wondering, 1. How do you get a list of sites to crawl? 2. Can you get a list of sites to crawl in your city? 3. If you have all this information, where is this readily…
Walter White
-1
votes
1 answer

Using wget to get count of pages below a link?

I've been using a sitemapping tool to get a simple count of links below a specific url. The free trial has ended, so I figure that rather than paying $70 for what is very simple functionality, I should just use wget. Here's what I have so far: wget…
rybosome
  • 111
  • 4
-1
votes
1 answer

Google web crawler cannot find my wordpress posts

I have a wordpress blog on my own server, which used permanant links containing Chinese characters in urls like http://techblog.zellux.czm.cn/2008/03/ics-lab4-%E7%BB%8F%E9%AA%8C/. Several months ago I changed all the urls with english descriptions…
Epeius
  • 1,031
  • 2
  • 9
  • 6