0

In order to support remote jQuery templating, I have some links appearing in javascripts. Like:

<script type="text/javascript">
var catalog = {};
catalog['key1'] = 'somepath/template_1.html';
catalog['key2'] = 'anotherpath/template_2.html';
//and so on
</script>

Now, crawlers are trying to follow those links. How to prevent this?

MatteoSp
  • 2,940
  • 5
  • 28
  • 36
  • 1
    you can try to encrypt those links, or just make a simple scramble algoritme nothing complext – Hugo Alves Feb 22 '13 at 09:52
  • 1
    you can add a robots.txt file to block crawlers from certain links: http://www.robotstxt.org/robotstxt.html – Pete Feb 22 '13 at 09:52
  • Pete, robots.txt is not a solution for 2 reasons. First urls aren't absolute, they are relative. Second there's guarantee crawlers honor robots.txt – MatteoSp Feb 22 '13 at 10:47

2 Answers2

1

First and foremost: which crawlers are trying to access those paths? Are they popular (e.g. Google Bot, Bing Bot, Yahoo! Slurp) or some other bots? Your best bet is to identify which crawlers are the "offenders" and then try to figure out why they're following those links. It's very difficult to tell you how to prevent this without making a bunch of assumptions.

Read on to see just how many assumptions can be made:

Suppose that there are two types of crawlers out there:

  1. Smart ones: they don't look for URLs in JavaScript, because it's very inefficient and it may result in pointless attempts to crawl things that are complete nonsense (such as http://link.to.other/javascript/stuff.js). However, these crawlers may be executing the JavaScript.
  2. Dumb ones: they may get the HTML content and apply a regex to extract all URLs. Most of the time such crawlers are very likely not even executing your JavaScript.

Having JavaScript execution capability in a crawler is quite complicated, so I would only think that very few crawlers out there have such a capability and if they do, then they're professional grade crawlers. If they're professional grade crawlers, then you may expect that they will most likely support robots.txt as well as things like "nofollow" for an anchor element's rel attribute:

<a href="http://www.example.com/" rel="nofollow">Link text</a>

I would bucket those in the "smart" crawler group. Most of the popular bots are pretty smart and they're also polite so you don't have to worry about them so much.

Does the JavaScript modify the document which would then result in a hyperlink of some sort? If yes, then a smart crawler can pick up the link, but a dumb crawler won't be able to because they are a lot less likely to execute the JavaScript.

So what can you do then? Well, for smart crawlers you should apply all of the standard politeness policies: robots.txt, "nofollow", etc. Most of the time that should be sufficient to prevent them from crawling those links. You want to be nice to them anyway, since they're probably helpful to your site (i.e. they're going to drive traffic to it based on your content).

For the dumb crawlers you might have to test out a few different solutions: obfuscate the URL or employ one of several strategies to detect them. You can do all kinds of things once you detect them, some are nice, some are not so nice :).

Again, you can see that without further information, we have to make A LOT of assumptions. So you should either provide us more information or at least try to analyze the information yourself and keep the above questions/ideas in mind.

Community
  • 1
  • 1
Kiril
  • 39,672
  • 31
  • 167
  • 226
  • thanks for the analisys. The easiest thing to do was following what Hugo Alves suggested in his comment. – MatteoSp Feb 26 '13 at 13:29
  • @MatteoSp I don't know the details of the crawlers you're encountering, but if that's what works for you then 1 UP! :) – Kiril Feb 26 '13 at 15:24
0

make it look less like links

var catalog = {
  'key1': {'path':'somepath',   'page':'template_1.html'},
  'key2': {'path':'anotherpath','page':'template_2.html'}
}
//and so on
mplungjan
  • 169,008
  • 28
  • 173
  • 236