Anybody got any C# code to parse robots.txt and evaluate URLS against it

Question

Short question:

Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.

Long question:

I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.

The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.

I would have to assume that someone might publish the /oprah link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.

So at the same time I was creating the sitemap I also added URLS such as /oprah to be excluded from our robots.txt file.

Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.

However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?

score 9 · Accepted Answer · answered Mar 11 '09 at 06:25

9

Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:

Check for, and if present, download and parse the robots.txt file on the site

Provide an interface for the Spider to check each Url against the robots.txt rules

answered Mar 11 '09 at 06:25

realMarkusSchmidt

4,303
1
29
33

3

oops. ill admit i didnt search google this time. however ironically this question is now the first match for 'c# robots.txt' :-) i'll see if i can extract what i need from that. thanks – Simon_Weaver Mar 11 '09 at 08:31
I hope you're not stuck in an infinite loop now ;-) Funny, they even show exactly the Google part of my answer as the preview text. I didn't realize Google has become so fast by now even for non-news sites, very interesting. – realMarkusSchmidt Mar 11 '09 at 09:49
Am I falling into the loop? :) – Velcro Nov 02 '17 at 15:52

score 3 · Answer 2 · answered May 14 '12 at 23:46

3

I like the code and tests in http://code.google.com/p/robotstxt/ would recommend it as a starting point.

answered May 14 '12 at 23:46

Sam Saffron

128,308
78
326
506

score 1 · Answer 3 · answered Sep 13 '10 at 19:01

1

A bit of self promoting, but since I needed a similar parser and couldn't find anything I was happy with, I created my own:

http://nrobots.codeplex.com/

I'd love any feedback

answered Sep 13 '10 at 19:01

SaguiItay

2,145
1
18
40

Anybody got any C# code to parse robots.txt and evaluate URLS against it

3 Answers3

Linked