Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

More on Wikipedia

98 questions
1
vote
1 answer

What things should I consider when identifying and rate limiting bots?

// Not sure if this question is best fit for serverfault or webmasters stack exchange... I am thinking to rate limit access to my sites because identifying and blocking bad bots take most of my time. For example I have bots accessing the site by…
adrianTNT
  • 1,077
  • 6
  • 22
  • 43
1
vote
1 answer

Getting requests for suspicious php files

I am getting weird GET requests on my (non php supporting) web server for some curious looking php files. Was just wondering whether these are harmless requests of certain browser tools or attempts from a crawler to find flaws / misconfigurations in…
Luftbaum
  • 111
  • 2
1
vote
1 answer

Website blocks my requests from linux ubuntu server

I'm a Java engineer with zero dev ops experience. Lately I was playing around with linux ubuntu server first time and used docker with my selenium project and faced this problem: I try to scrape HTML from a website, but my calls are getting blocked,…
1
vote
0 answers

Spotify Bot Using Massive Bandwidth on NGINX Cached Server?

I have a couple of podcasts I host on my site and I've noticed a disturbing trend the last couple of months: my site's bandwidth usage has gone up by 10x, but it appears most of it was a series of Google App Server instances, not an incredible…
Timothy R. Butler
  • 703
  • 2
  • 11
  • 22
1
vote
1 answer

Why would Apache log different response sizes for the same url?

I noticed a couple (ostensibly-)harmless log entries, and--I'm admittedly overthinking this by a mile--got curious about Apache2 response sizes. This Ukranian crawler † hit my web daemon, two seconds later requesting a duplicate.  Apache2 replied…
zedmelon
  • 113
  • 6
0
votes
1 answer

How to block attempts for phpMyAdmin?

I converted my website from asp.net to .net core and host on same server. Now, website gets hundred of hits daily from different IP's trying to access like below /php-myadmin/ /wp-content/ /mysql/ None of these directories exist on my website, I…
0
votes
1 answer

My website might have problems being indexed by Google bots?

http://ptcsavjetovaliste.org and I think because it is in Croatian language it might have problems being indexed because of letters like čćžšđ?! Look at the crawler errors I see in Webmaster tools...…
Vedran
0
votes
2 answers

Counting the number of pages in a website

What is the easiest way to get a count of the number of pages on a website? I don't want to actually download a local copy the entire site, just get a count of pages on it. Is there a tool (or combination of tools) that can crawl all the pages and…
DrStalker
  • 6,946
  • 24
  • 79
  • 107
0
votes
0 answers

Strange Google Behavior with indexing SSL Mismatch content

Here is a strange one for you. We have a server with multiple VHOSTS that include both SSL and Non-SSL domains. Domain1 is SSL enabled, while Domain2 doesn't have SSL. Since all these domains are hosted on the same IP, apache would respond to…
mamad
  • 1
  • 1
0
votes
1 answer

Yandex/Google Bot Spam

I recently logged into a vps I have (with vultr, if that is of any concern). To find a large amount of nginx logs and higher than expected load average. This server is doing effectively nothing, and just serves the default nginx page on port 80. An…
dukky
  • 1
0
votes
1 answer

Bots/crawlers adding numbers to GET parameters

I've got some errors showing up in my site logs where some bots are trying to access URLs with strange GET params. # normal url example.com?foo=123456 # odd url triggering integer error by bots example.com?foo=1234562121121121212.1 I've got the…
Pete
  • 293
  • 1
  • 5
  • 20
0
votes
1 answer

Remove subdomain from Google Crawler

I recently removed a sub-domain from my domain so I just have 1 website to manage. However, if I do a google search, my old domain is still there, I removed the sub-domain well over a week ago and if you try to access the domain directly, you will…
Walter White
0
votes
1 answer

Moved website to new server - updated DNS - web crawlers still hitting old site by IP

About ten days ago I moved a site - mostly a Joomla discussion board - to a new server at a different IP address. During a brief scheduled downtime I replicated the content over and completed DNS switchover (via Cloudflare) as usual, and most…
Ryan
  • 81
  • 1
  • 8
0
votes
1 answer

Nginx log shows suspicious directory access!!! How to block them?

On my Nginx log recently i have noticed 100's entries like this where a directory search was executed with error, because those directory does not exist on my webserver. now, how can I block them once they failed searching few…
Tapash
  • 153
  • 1
  • 6
0
votes
0 answers

How to make Google crawl my site using IPv6 address when my domain name has both IPv4 and IPv6 addresses?

My domain name has both IPv4 and IPv6 addresses assigned. IPv4 connection to Google can't be available all the time due to restrictions of my campus network, but IPv6 is available all the time. Google fails to access my site when IPv4 connection…