Questions tagged [web-crawler]

A web-crawler (also known as a web-spider) traverses the webpages of the internet by following the links of urls contained within each webpage. There is usually an initial seed of URLs from which the crawler is given to initialize its crawl.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

What things should I consider when identifying and rate limiting bots?

// Not sure if this question is best fit for serverfault or webmasters stack exchange... I am thinking to rate limit access to my sites because identifying and blocking bad bots take most of my time. For example I have bots accessing the site by…

asked Jan 30 '23 at 00:52

adrianTNT

1,077
6
22
43

vote

1 answer

Getting requests for suspicious php files

I am getting weird GET requests on my (non php supporting) web server for some curious looking php files. Was just wondering whether these are harmless requests of certain browser tools or attempts from a crawler to find flaws / misconfigurations in…

php web-server web-crawler

asked Mar 03 '22 at 09:42

Luftbaum

vote

1 answer

Website blocks my requests from linux ubuntu server

I'm a Java engineer with zero dev ops experience. Lately I was playing around with linux ubuntu server first time and used docker with my selenium project and faced this problem: I try to scrape HTML from a website, but my calls are getting blocked,…

linux ubuntu curl web-crawler

asked Jan 31 '22 at 20:21

Vytautas Šerėnas

vote

0 answers

Spotify Bot Using Massive Bandwidth on NGINX Cached Server?

I have a couple of podcasts I host on my site and I've noticed a disturbing trend the last couple of months: my site's bandwidth usage has gone up by 10x, but it appears most of it was a series of Google App Server instances, not an incredible…

nginx apache-2.4 bandwidth web-crawler

asked May 16 '21 at 17:41

Timothy R. Butler

vote

1 answer

Why would Apache log different response sizes for the same url?

I noticed a couple (ostensibly-)harmless log entries, and--I'm admittedly overthinking this by a mile--got curious about Apache2 response sizes. This Ukranian crawler † hit my web daemon, two seconds later requesting a duplicate. Apache2 replied…

linux cache apache2 log-files web-crawler

asked Apr 26 '20 at 19:51

zedmelon

votes

1 answer

How to block attempts for phpMyAdmin?

I converted my website from asp.net to .net core and host on same server. Now, website gets hundred of hits daily from different IP's trying to access like below /php-myadmin/ /wp-content/ /mysql/ None of these directories exist on my website, I…

web-server asp.net http-status-code-404 web-crawler

asked Jun 20 '19 at 07:06

Bunty Choudhary

votes

1 answer

My website might have problems being indexed by Google bots?

http://ptcsavjetovaliste.org and I think because it is in Croatian language it might have problems being indexed because of letters like čćžšđ?! Look at the crawler errors I see in Webmaster tools...…

web-crawler

asked Dec 24 '09 at 12:38

Vedran

votes

2 answers

Counting the number of pages in a website

What is the easiest way to get a count of the number of pages on a website? I don't want to actually download a local copy the entire site, just get a count of pages on it. Is there a tool (or combination of tools) that can crawl all the pages and…

website web-crawler

asked Nov 27 '09 at 02:55

DrStalker

6,946
24
79
107

votes

0 answers

Strange Google Behavior with indexing SSL Mismatch content

Here is a strange one for you. We have a server with multiple VHOSTS that include both SSL and Non-SSL domains. Domain1 is SSL enabled, while Domain2 doesn't have SSL. Since all these domains are hosted on the same IP, apache would respond to…

ssl google web-crawler

asked May 25 '17 at 17:35

mamad

votes

1 answer

Yandex/Google Bot Spam

I recently logged into a vps I have (with vultr, if that is of any concern). To find a large amount of nginx logs and higher than expected load average. This server is doing effectively nothing, and just serves the default nginx page on port 80. An…

nginx logging web-crawler

asked May 05 '17 at 00:36

dukky

votes

1 answer

Bots/crawlers adding numbers to GET parameters

I've got some errors showing up in my site logs where some bots are trying to access URLs with strange GET params. # normal url example.com?foo=123456 # odd url triggering integer error by bots example.com?foo=1234562121121121212.1 I've got the…

http url web-crawler

asked Mar 13 '17 at 10:28

Pete

votes

1 answer

Remove subdomain from Google Crawler

I recently removed a sub-domain from my domain so I just have 1 website to manage. However, if I do a google search, my old domain is still there, I removed the sub-domain well over a week ago and if you try to access the domain directly, you will…

domain subdomain google web-crawler

asked Oct 25 '09 at 14:15

Walter White

votes

1 answer

Moved website to new server - updated DNS - web crawlers still hitting old site by IP

About ten days ago I moved a site - mostly a Joomla discussion board - to a new server at a different IP address. During a brief scheduled downtime I replicated the content over and completed DNS switchover (via Cloudflare) as usual, and most…

apache-2.2 web-crawler googlebot

asked Nov 18 '15 at 20:26

Ryan

votes

1 answer

Nginx log shows suspicious directory access!!! How to block them?

On my Nginx log recently i have noticed 100's entries like this where a directory search was executed with error, because those directory does not exist on my webserver. now, how can I block them once they failed searching few…

nginx web-crawler

asked Jun 29 '15 at 09:23

Tapash

votes

0 answers

How to make Google crawl my site using IPv6 address when my domain name has both IPv4 and IPv6 addresses?

My domain name has both IPv4 and IPv6 addresses assigned. IPv4 connection to Google can't be available all the time due to restrictions of my campus network, but IPv6 is available all the time. Google fails to access my site when IPv4 connection…

google web-crawler

asked Mar 03 '15 at 01:14

ReeseWang

Prev 1 2 3

5 6 7 Next