10

I have built a pretty basic advertisement manager for a website in PHP.

I say basic because it's not complex like Google or Facebook ads or even most high end ad servers. Doesn't handle payments or anything or even targeting users.

It serves the purpose for my low traffic site though to simply show a random banner ad, count impression views and clicks.

Features:

  • Ad slot/position on page
  • Banner image
  • Name
  • View/impression counter
  • Click counter
  • Start and end date, or never ending
  • Disable/enable ad

I am wanting to gradually add more functionality to the system though.

One thing I have noticed is the Impressions/views counter often seems inflated.

I believe the cause of this is from Social networks' spiders and bots as well as search engine spiders.

For example, if someone enters a URL from a page on my website into Facebook, Google+, Twitter, LinkedIn, Pinterest, and other networks, those sites will often spider my site to gather the webpages Title, images, and description.

I would really like to be able to disable this from counting as Advertisement impressions/view counts when an actual human is not viewing the page.

I realize this will be very hard to detect all these but if there is a way to get a majority of them, at least it will make my stats a little more accurate.

So I am reaching out for any help or ideas on how to achieve my goal? Please do not say to use another advertisement system, that is not in the cards, thank you

enter image description here

JasonDavis
  • 48,204
  • 100
  • 318
  • 537
  • 2
    You should consider filtering on user-agent. A clever bot will always be able to impersonate a browser, though. – nanofarad Jul 07 '13 at 19:12
  • I would suggest issuing ajax post after pageload with ids of banners on page. Additionally you can disallow this updating script in robots.txt – dev-null-dweller Jul 07 '13 at 19:23

4 Answers4

15

You need to serve the ADs with JavaScript. That's the only way to avoid most of the crawlers. Only browsers load dependencies like Images, JS and CSS. 99% of the robots avoid them.

You can also do this:

// basic crawler detection and block script (no legit browser should match this)
if(!empty($_SERVER['HTTP_USER_AGENT']) and preg_match('~(bot|crawl)~i', $_SERVER['HTTP_USER_AGENT'])){
    // this is a crawler and you should not show ads here
}

You'll have much better stats this way. Use JS for ads.

PS: You could also try setting a cookie in JS and later checking for it. Crawlers might get cookies sent in PHP by HTTP but those set in JS, 99.9% chances they'll miss it. Because they need to load a JS file and interpret it. That's only done by browsers.

CodeAngry
  • 12,760
  • 3
  • 50
  • 57
  • This solution seems to be best at this point but for a self developed ad app I wouldn't recommend, crawlers evolve in a very high pace. As more and more websites will use js to validate users, crawlers will implement it also... – jnhghy - Alexandru Jantea Oct 16 '13 at 05:49
  • @alexalex **No they won't, as JS responds to user input.** Mouse, keyboard and such... so a crawler cannot generate all combination of that input and track what the JS is doing. Google does include JS as they take snapshots of pages for the preview. But they also get stuck on the `onLoad` layer and none of the interactivity. *So... NO.* No in-house crawler will justify loading JS, now and in the foreseeable future. – CodeAngry Oct 16 '13 at 10:34
  • I like the Bold charachter... http://www.emoticode.net/python/rendered-javascript-crawler-with-scrapy-and-selenium-rc.html what do you mean by foreseeable? by how? we are talking about an advertisement manager, so I can see a lot of reasons to create a crawler that will hit ads... I just wanted to say that any solutions for spider/bot detection it this field is a solution only for the given time -> it has to bee constantly updated... saying that use of js is best is correct(I upvoted the answer) but this to be the solutin for an foreseeable future... I'm not so sure... – jnhghy - Alexandru Jantea Oct 16 '13 at 12:33
  • @alexalex If someone creates a specialized crawler to hit ads, there's no stopping it. But he wants to minimize false positives. Which means 99% of the crawlers (1% know JS). It's MUCH better then nothing... – CodeAngry Oct 16 '13 at 18:07
  • This answer is no longer correct. Many bots (including Google) run all JS and render the page fully before indexing, in order to handle the increasing number of sites which load their content dynamically over AJAX (and single page apps) etc. – NickG Mar 14 '18 at 14:45
  • @CodeAngry - so how does this check if JS is "enabled" ... meaning how it checks if it is a BOT, based on JS enabled/disabled/loaded. I'm dealing with similar issues for inhouse stats tracking, and noticed greatly inflated DESKTOP traffic. I implemented bot filtering by user agent, but still not 100% success, as many bots/crawlers/spiders mimic the browser User Agent. Please provide "solution" or explain what your answer means (for lay people like me) – Levchik Aug 03 '20 at 13:59
  • @CodeAngry - follow up on my previous comment - I used this script: https://stackoverflow.com/a/60055171/1325195 ... not sure how good it is. – Levchik Aug 03 '20 at 14:07
0

You could do something like this: There is a good list of crawlers in text format here: http://www.robotstxt.org/db/all.txt

assume you've collected all of the user agents in that file in an array called $botList

$ua = isset($_SERVER['HTTP_USER_AGENT']) ? strtolower($_SERVER['HTTP_USER_AGENT']) : NULL;

if($ua && in_array($ua, $botList)) {
  // this is probably a bot
}

Of course, user agent easily can be changed or may be missing sometimes, but search engines like Google and Yahoo are honest about themselves.

keune
  • 5,779
  • 4
  • 34
  • 50
  • -1 You are not teaching him to look through 256KB of user agents and compare strings against so many possibilities! You are not telling him to kill his website's performance! Right? – CodeAngry Jul 07 '13 at 19:26
  • 1
    Have you checked that list? I said to only take user agents text and put them into an array. And i meant not in runtime, one time by hand-coding. This would have almost no effect on performance. Your attitude is very unconstructive btw. – keune Jul 07 '13 at 19:33
  • It's very unconstructive :) And always pro-performance. Read my response to see my attitude. – CodeAngry Jul 07 '13 at 19:35
  • 1
    It wouldn't do any good either. You do save some time by avoiding file i/o but there's a gazillion entries in there - performance would take a hit anyway. Unless you're running on a cloud and generating a ton of revenue and can afford multiple processors just to go through the list. Sorry, I'm with @CodeAngry on this one. – rath Jul 07 '13 at 19:55
  • This was pretty much my original idea. Not exactly for all those but like the main ones for all the social networks and main search engines so it would be maybe a max of 10. I do think the JavaScript method will be more reliable though – JasonDavis Jul 07 '13 at 22:14
  • If you hard-code the entries in the list into a hash map or dictionary with constant-time read access, you will probably not recognize any effect on the performance of the website. This is most certainly not a complete approach, but not so bad either. And performance is no argument here, if you realize the suggestion in a sensible manner. – dsteinhoefel May 21 '14 at 06:58
0

A crawler will download robots.txt, even if it doesn't respect it and does it out of curiosity. This is a good indication you might be dealing with one, although it's not definite.

You can detect a crawler if he visits a huge number of links in a very short time. This can be quite complicated to do in code though.

But that's only feasible if you don't want or can't run Javascript. Otherwise go with CodeAngry's answer.


Edit: In response to @keune's answer, you could keep all the visitor IPs and run them through the list in a cron job, then publish the updated visitor count.

rath
  • 3,655
  • 1
  • 40
  • 53
  • 1
    **I crawl the web like crazy but never bother about `robots.txt` ;)** Plus tracking `robots.txt` access can be done by `raw access logs` or `generating the file in php`. It complicates things a bit and won't prevent shady crawlers... **The access speed can be an indicator** but I almost always load one page from each domain when crawling. And it can bottleneck severely in the DB write access if not done properly. – CodeAngry Jul 07 '13 at 20:02
  • You're a rude spider then. – rath Jul 07 '13 at 20:05
  • 2
    I'm an **SEO spider**. We're all the same :) Why would you want stick out when you need to go unnoticed... – CodeAngry Jul 07 '13 at 20:06
0

Try this:

if (preg_match("/^(Mozilla|Opera|PSP|Bunjalloo|wii)/i", $_SERVER['HTTP_USER_AGENT']) && !preg_match("/bot|crawl|crawler|slurp|spider|link|checker|script|robot|discovery|preview/i", $_SERVER['HTTP_USER_AGENT'])) {
    It's not a bot
} else {
    It's a bot
}
Jose Gómez
  • 3,110
  • 2
  • 32
  • 54
L F
  • 29
  • 3