Any Good Open Source Web Crawling Framework in C#

Question

Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process.

I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and they are known to be highly buggy and unstable for large crawls. So I have decided NOT to build on them. Even in framework 4.0 they are buggy.

I speak by my own personal experience.

I would like opinions from experts here who have been coding crawlers, if they know about any good open source crawling frameworks, like java has nutch and apache commons which are very stable and highly robust libraries.

If there are some already existing crawling frameworks in C#, I shall go ahead and build my application on top of them.

If not am planning to extend this solution from code project and extend it.

http://www.codeproject.com/KB/IP/Crawler.aspx

If any one can suggest me a better path, I shall be really thankful.

EDIT : Some sites which I have to crawl render the page using very complex Java Scripts, now this added more complexity to my web crawlers since I need to be able to crawl pages rendered by JavaScript. If someone has used any library in C# which can crawl javascript rendered, please do share. I have used watin which I dont prefer and I also know about selenium. If you know about anything other than these please do share with me and the community.

@slacks, httpwebrequest and cookiecollection class messup my sessions when I crawl websites which need logins. I had to individually add in cookies to the cookiecollection object to make sure it works as it is supposed to. There are several such examples. — Sumit Ghosh, Dec 05 '10 at 18:25
While doing my research I found one open source solution - http://arachnode.net/ Any one here has used this before, any reviews on this? — Sumit Ghosh, Dec 05 '10 at 18:33
@Sumit: No such issue exists. If you're having trouble, ask a separate question. — SLaks, Dec 05 '10 at 21:48
@slaks, do you work for MS? looks like so, the bugs do exist and its not only me but a whole community shall vouch for that, MS has stupidly coded lot of session handling code in httpwebrequest. — Sumit Ghosh, Dec 07 '10 at 07:55
`HttpWebRequest` *et. al.* are buggy and unstable for large crawls? I guess I should stop using them, then, for crawling more than 50 million web pages per day? — Jim Mischel, Feb 14 '11 at 21:29

score 3 · Answer 1 · answered Mar 20 '13 at 21:22

Abot C# Web Crawler

Description from http://code.google.com/p/abot/ says : Abot is an open source C# web crawler built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..). You just hook into key events to process data or plugin your own implementations of core interfaces to take complete control over the crawl process.

haven't used it though.

score 3 · Accepted Answer · answered Feb 11 '15 at 16:36

PhantomJS + HtmlAgilityPack

I know this topic is a bit old, but I've had the best results by far with PhantomJS. There is a NuGet package for it, and combining it with HtmlAgilityPack makes for a pretty decent fetching & scraping toolkit.

This example just uses PhantomJS's built in parsing capabilities. This worked with a very old version of the library; since it seems to be under active development still, it'd be safe to assume that even more capabilities have been added.

void Test()
{
    var linkText = @"Help Spread DuckDuckGo!";
    Console.WriteLine(GetHyperlinkUrl("duckduckgo.com", linkText));
    // as of right now, this would print ‘https://duckduckgo.com/spread’
}

/// <summary>
/// Loads pageUrl, finds a hyperlink containing searchLinkText, returns
/// its URL if found, otherwise an empty string.
/// </summary>
public string GetHyperlinkUrl(string pageUrl, string searchLinkText)
{
    using (IWebDriver phantom = new PhantomJSDriver())
    {
        phantom.Navigate.GoToUrl(pageUrl);
        var link = phantom.FindElement(By.PartialLinkText(searchLinkText));
        if(link != null)
            return link.GetAttribute("href");
    }
    return string.Empty;
}

score 2 · Answer 3 · answered Feb 12 '13 at 17:57

2

arachnode.net can process JavaScript.

answered Feb 12 '13 at 17:57

arachnode.net

791
5
12

score 2 · Answer 4 · answered Dec 05 '10 at 19:43

2

I know of something called NCrawler, available on codeplex. Not used it personally, but a colleague says it works OK.

answered Dec 05 '10 at 19:43

Rikalous

4,514
1
40
52

score 0 · Answer 5 · answered Feb 14 '11 at 20:35

0

Ncrawler does not support Javascript.But it looks very good , and easy to use solution if you don't need javascript execution

answered Feb 14 '11 at 20:35

John

864
1
11
26

score 0 · Answer 6 · answered Nov 04 '22 at 10:17

I understand this topic is very old, but I made a solution for fast crawlers writing and may be useful for someone else. The package name is

Laraue.Crawling.Dynamic.PuppeterSharp

The main idea that first you describe a model that you want to receive

public class User
{
    string Name { get; set; }
    int Age { get; set; }
    string[] ImageLinks { get; set; }
}

And then write how to fill it values

var schema = new PuppeterSharpSchemaBuilder<User>()
    .HasProperty(x => x.Name, ".name")
    .HasProperty(x => x.Age, ".age")
    .HasArrayProperty(
        x => x.ImageLinks,
        ".links a",
        async handle => await handle.GetAttributeValueAsync("href"))
    .Build();

Then this schema can be parsed. The library use PuppeterSharp package inside

// Download browser and open the page
await new BrowserFetcher().DownloadAsync();
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions());
var page = await browser.NewPageAsync();
var response = await page.GoToAsync(link);

// Parse the page using described schema
var parser = new PuppeterSharpParser(new LoggerFactory());
var model = await parser.RunAsync(schema, await page.QuerySelectorAsync("body"));

The library supports also static crawling via AngleSharp library when JS rendering is not required through the package

Laraue.Crawling.Static.AngleSharp

The schema describes the same way.

Any Good Open Source Web Crawling Framework in C#

6 Answers6