11

From the HTTP server's perspective.

orph
  • 113
  • 1
  • 1
  • 5

5 Answers5

14

You can read the official Verifying Googlebot page.

Quoting the page here:

You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

For example:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer  crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

Community
  • 1
  • 1
imgx64
  • 4,062
  • 5
  • 28
  • 44
  • Is there no way to query google.com or googlebot.com every so often using dns to get the list of ip or ip ranges? Doing this for every incoming request seems painful. Something like an mx record for A or AAAA records. – jjxtra Oct 08 '21 at 19:42
  • 1
    @jjxtra I would implement this with caching. If you only look up the IP addresses that you haven't looked up recently, it works very well. – Stephen Ostermiller Nov 05 '21 at 09:55
12

I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks.

Requesting IP: 66.249.71.113
Client: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA.

A nice solution to check if the request is coming from Google crawler would be to verify the request to contain Googlebot and http://www.google.com/bot.html. As I said there are many IPs observed with the same requesting client, I'd not recommend to check on IPs. And may be that's where Client identity come into the picture. So go for verifying client identity.

Here's a sample code in C#.

    if (Request.UserAgent.ToLower().Contains("googlebot") || 
             Request.UserAgent.ToLower().Contains("google.com/bot.html"))
    {
        //Yes, it's google bot.
    }
    else
    {
        //No, it's something else.
    }

It's important to note that, any Http-client can easily fake this.

this. __curious_geek
  • 42,787
  • 22
  • 113
  • 137
1

You can now perform an IP address check, by checking against googlebot's published IP address list at https://developers.google.com/search/apis/ipranges/googlebot.json

From the docs:

you can identify Googlebot by IP address by matching the crawler's IP address to the list of Googlebot IP addresses. For all other Google crawlers, match the crawler's IP address against the complete list of Google IP addresses.

galdin
  • 12,411
  • 7
  • 56
  • 71
0

If you're using Apache Webserver, you could have a look at the log file 'log\access.log'.

Then load google's IPs from http://www.iplists.com/nw/google.txt and check whether one of the IPs is contained in your log.

weberph
  • 539
  • 1
  • 4
  • 9
0

Based on this. __curious_geek's solution, here's the javascript version:

if(window.navigator.userAgent.match(/googlebot|google\.com\/bot\.html/i)) {
  // Yes, it's google bot.
}
Sam
  • 5,375
  • 2
  • 45
  • 54