8

Short question:

Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.

Long question:

I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.

The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.

I would have to assume that someone might publish the /oprah link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.

So at the same time I was creating the sitemap I also added URLS such as /oprah to be excluded from our robots.txt file.

Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.

However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?

Jon Limjap
  • 94,284
  • 15
  • 101
  • 152
Simon_Weaver
  • 140,023
  • 84
  • 646
  • 689

3 Answers3

9

Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:

  1. Check for, and if present, download and parse the robots.txt file on the site
  2. Provide an interface for the Spider to check each Url against the robots.txt rules
realMarkusSchmidt
  • 4,303
  • 1
  • 29
  • 33
  • 3
    oops. ill admit i didnt search google this time. however ironically this question is now the first match for 'c# robots.txt' :-) i'll see if i can extract what i need from that. thanks – Simon_Weaver Mar 11 '09 at 08:31
  • I hope you're not stuck in an infinite loop now ;-) Funny, they even show exactly the Google part of my answer as the preview text. I didn't realize Google has become so fast by now even for non-news sites, very interesting. – realMarkusSchmidt Mar 11 '09 at 09:49
  • Am I falling into the loop? :) – Velcro Nov 02 '17 at 15:52
3

I like the code and tests in http://code.google.com/p/robotstxt/ would recommend it as a starting point.

Sam Saffron
  • 128,308
  • 78
  • 326
  • 506
1

A bit of self promoting, but since I needed a similar parser and couldn't find anything I was happy with, I created my own:

http://nrobots.codeplex.com/

I'd love any feedback

SaguiItay
  • 2,145
  • 1
  • 18
  • 40