0

So what I am trying to do is check an ip address, for example: 66.249.74.233. That ip address converts when looking up a hostname to: crawl-66-249-74-233.googlebot.com.

What I would like to do from there is then check: crawl-66-249-74-233.googlebot.com to make sure it belongs to google. I believe it can be done but I am not sure what I am looking for to pass the hostname to confirm it belongs to google.

Any thoughts?

MrTechie
  • 1,797
  • 4
  • 20
  • 36
  • Unfortunately you can't do this with accuracy. While *Google* might happen to *own* the IP address range that they are using, this will not be true in the general case. For example, look up `stackoverflow.com`'s IP, then look to see who owns the block. You'll find it belongs to the colo center(s) that SO uses. Can you tell us more about what you're trying to *accomplish* here? What's your underlying goal? See also: [the XY problem](http://meta.stackexchange.com/q/66377/135887) – Charles Dec 22 '12 at 22:04
  • 1
    @Charles - basically have a client's site that likes hacker's to try to come in. But I have setup some code with an ip to check the country. Basically if it's overseas - it's added to a ban list. Recently I was checking some stuff in webmaster tools - fetched as Goole Bot and found it was banned - due to the ip resided in canada, out of US. So I am trying to find a solution for that problem and was thinking just checking the hostname and confirming it. – MrTechie Dec 22 '12 at 22:11
  • Okay, that makes more sense. Were you simply trying to ensure that Googlebot (and other known bots, I suppose) aren't accidentally banned? – Charles Dec 22 '12 at 22:22
  • If you just want to check that the reverse domain entry is legitimate, perform a forward lookup of the name and see if you get back the original IP. – Barmar Dec 22 '12 at 22:28
  • @Charles - yes. One way I thought about doing it was exploding the hostname based on the . and then check an array against the exploded array to see if any of them matched - then I would know it was possibly the true bot. But thought maybe there's a better solution. – MrTechie Dec 22 '12 at 22:28
  • @Barmar - yes, I could do that too. Take this: $dnsr = dns_get_record($hostname, DNS_A); and then get this [host] => crawl-66-249-74-233.googlebot.com [ip] => 66.249.74.233 back and then just confirm it. – MrTechie Dec 22 '12 at 22:31
  • This sounds like it will create a bottleneck for page loads as it must wait for a lookup before deciding to let a visitor in or not. "basically have a client's site that likes hacker's to try to come in." why don't you just fix the security? If it is just bots scanning for security issues if you fix your security you don't have to worry about them finding anything. Trying to block by IP origin is not ideal, and all they have to do is use a proxy and they bypass it. Fix the root of the problem. – kittycat Dec 22 '12 at 22:38
  • Fixing the root of the problem requires a complete site redo. This code was written 10 to 12 years ago. The site had major issues, and I have managed to stabilize it for the most part. I know about hackers using a proxy, but the ban acts as somewhat of a deterrent for them, until a new site can be put together. I have fixed some of the issues regarding upload exploits, sql exploits etc. But the site is so poorly written, I can't get them all. – MrTechie Dec 22 '12 at 22:42
  • @cryptic I didn't realize he would be using this code in real time when someone tries to connect. I thought he was using it when creating his blacklist, to ensure he doesn't put legitimate crawler IPs on the list. I wonder if there's a DNSBL of crawlers that he could use. – Barmar Dec 22 '12 at 22:45
  • @Barmar - the intention is to for a user to come to the site, and the scripting checks for if something funky happened with the url. For example ?id=398928\' and then if that happens, it triggers the ip checking feature to find out where they are at. If they are out of the US (because the site is for a local paper) they basically get banned, and their ip is added to the list. – MrTechie Dec 22 '12 at 22:49
  • and really - someone -1 because I asked a question?? pathetic... – MrTechie Dec 22 '12 at 22:50
  • Realtime reverse lookups are generally a bad idea, because many sites don't have their reverse DNS configured properly. A common result is a timeout, which can take on the order of 30 seconds. – Barmar Dec 22 '12 at 22:54
  • I'm tied into an api function seen here: http://ipinfodb.com/ip_location_api.php which gives me a result in about a second or so. – MrTechie Dec 22 '12 at 22:55
  • 1
    @MrTechie, btw I didn't downvote I'll upvote to even it out though since it shouldn't have been. Anyway, have you looked at http://phpids.org/ ? For an application like yours this might be useful in determining an attack and then just denying the request based on the attack severity, and optionally you will be able to then add the IP to blacklist. A legitimate search engine should not trigger any of these rules. This then won't require a lengthy lookup and may prove to be more robust in your situation. – kittycat Dec 22 '12 at 23:14
  • @cryptic - thanks - I will take a look at it. It looks like it may be something that could stand useful for this site. – MrTechie Dec 22 '12 at 23:22

0 Answers0