Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process.
I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and they are known to be highly buggy and unstable for large crawls. So I have decided NOT to build on them. Even in framework 4.0 they are buggy.
I speak by my own personal experience.
I would like opinions from experts here who have been coding crawlers, if they know about any good open source crawling frameworks, like java has nutch and apache commons which are very stable and highly robust libraries.
If there are some already existing crawling frameworks in C#, I shall go ahead and build my application on top of them.
If not am planning to extend this solution from code project and extend it.
http://www.codeproject.com/KB/IP/Crawler.aspx
If any one can suggest me a better path, I shall be really thankful.
EDIT : Some sites which I have to crawl render the page using very complex Java Scripts, now this added more complexity to my web crawlers since I need to be able to crawl pages rendered by JavaScript. If someone has used any library in C# which can crawl javascript rendered, please do share. I have used watin which I dont prefer and I also know about selenium. If you know about anything other than these please do share with me and the community.