What is the best way to black-list search engines?

Question

I have built a photo community web application in PHP/MySQL, using CodeIgniter as a framework. All content is public so search engines regularly drop by. This is exactly what I want, yet it has two unwanted side effects:

Each visit creates a session in my session table.
Each visit of a search engine to a photo page increases the view counter

As for the second problem, I am rewriting the call to my view count script to be called from javascript only, that should prevent a count increase from search engines, right?

As for the session table, my thinking was to clean it up after the fact using a cron, to not have an impact on performance. I'm recording the IP and user agent string in the session table so it appears to me that a blacklist approach is best? If so, what is the best way to approach it? Is there an easy/reusable way to determine that a session is from a search engine?

I worry about them because the amount of sessions they create is large and this will eventually explode my db — Fer, Mar 18 '11 at 14:43
along the lines of @meager, unless the search engines are hitting your site 1000s of times a day, there should be no performance degradation on your site. — Patrick, Mar 18 '11 at 14:44

score 1 · Answer 1 · answered Mar 18 '11 at 14:39

1

Identify major search engines (Hint)
Check visitors against your precompiled list (above)
Do not start session/increase counter on match

Edit:

List of User-Agents

answered Mar 18 '11 at 14:39

fabrik

14,094
8
55
71

Thanks. that's quite an extensive list. It's probably best I run this check afterwards, not upon each page visit. – Fer Mar 18 '11 at 14:46
If you check your db every now and then you can easily eliminate these records on a weekly/monthly basis. – fabrik Mar 18 '11 at 14:48

score 1 · Answer 2 · answered Mar 18 '11 at 14:44

Why are you worried about either of these situations? The best strategy for dealing with crawlers is to treat them like any other user.

Sessions created by search engines are no different than any other session. They all have to be garbage collected, as you can't possibly assume that every user is going to click the "logout" button when they leave your site. Handle them the same way as you handle any expired session. You have to do this anyways, so why invest extra time in treating search engines differently?

As far as search engine incrementing view counters, why is that a problem? "View count" is a miss-leading term anyways; what you're really telling people is how many times the page has been requested. It's not up to you to insure a pair of eyeballs actually sees the page, and there is really no reasonable way of doing so. For every bot you "blacklist", there will be a dozen more one-offs scraping your content and not serving up friendly user-agent strings.

@meager I can see why the OP is concerned about the view count being off (he just wants to accurately represent the # of times somebody has viewed a picture), but I agree with you on the sessions. — Patrick, Mar 18 '11 at 14:46
View count in my scenario is not how many times the page is requested as you say. It takes into account the user, ip/user agent, timestamp. — Fer, Mar 18 '11 at 15:27

score 0 · Answer 3 · answered Mar 18 '11 at 14:41

0

Use a robots.txt file to control exactly what search engine crawlers are allowed to see and do

answered Mar 18 '11 at 14:41

Mark Baker

209,507
32
346
385

The point is not that the search engine should not visit the page. The point is that it should not create a session. As far as I know, robots.txt cannot do that. – Fer Mar 18 '11 at 14:44
robots.txt simply lists which urls/directories should/shouldn't be visited by the crawler. sessions have nothing to do with that at all. – Marc B Mar 18 '11 at 14:56
@marcb. I am aware of what robots.txt is. As I said, opening a URL leads to a session. I want search engines to open the URL yet not create a session. Robots.txt cannot do that. – Fer Mar 18 '11 at 15:21

What is the best way to black-list search engines?

3 Answers3