Avoid robots from going into a www.domain.com/thishash when link posted to twitter, facebook

Question

I'm building a service where people gets notified (mails) when they follow a link with the format www.domain.com/this_is_a_hash. The people that use this server can share this link on different places like, twitter, tumblr, facebook and more...

The main problem I'm having is that as soon as the link is shared on any of this platforms a lot of request to the www.domain.com/this_is_a_hash are coming to my server. The problem with this is that each time one of this requests hits my server a notification is sent to the owner of the this_is_a_hash, and of course this is not what I want. I just want to get notifications when real people is going into this resource.

I found a very interesting article here that talks about the huge amount of request a server receives when posting to twitter...

So what I need is to avoid search engines to hit the "resource" url... the www.mydomain.com/this_is_a_hash

Any idea? I'm using rails 3.

Thanks!

unor · Accepted Answer · 2013-05-11T17:29:25.383

If you don’t want these pages to be indexed by search engines, you could use a robots.txt to block these URLs.

User-agent: *
Disallow: /

(That would block all URLs for all user-agents. You may want to add a folder to block only those URLs inside of it. Or you could add the forbidden URLs dynamically as they get created, however, some bots might cache the robots.txt for some time so they might not recognize that a new URL should be blocked, too.)

It would, of course, only hold back those bots that are polite enough to follow the rules of your robots.txt.

If your users would copy&paste HTML, you could make use of the nofollow link relationship type:

<a href="http://example.com/this_is_a_hash" rel="nofollow">cute cat</a>

However, this would not be very effective, as even some of those search engines that support this link type still visit the pages.

Alternatively, you could require JavaScript to be able to click the link, but that’s not very elegant, of course.

But I assume they only copy&paste the plain URL, so this wouldn’t work anyway.

So the only chance you have is to decide if it’s a bot or a human after the link got clicked.

You could check for user-agents. You could analyze the behaviour on the page (e.g. how long it takes for the first click). Or, if it’s really important to you, you could force the users to enter a CAPTCHA to be able to see the page content at all. Of course you can never catch all bots with such methods.

You could use analytics on the pages, like Piwik. They try to differentiate users from bots, so that only users show up in the statistics. I’m sure most analytics tools provide an API that would allow sending out mails for each registered visit.

Avoid robots from going into a www.domain.com/thishash when link posted to twitter, facebook

1 Answers1