4

I'm currently making a Web Crawler in C#, and I have a method that receives HTML strings, extracts links from them and inserts the links into the list of all the links captured.

Since it is multi-threaded, I have used locks to prevent the list of all the strings to be accessed from a few different threads at the same time.

Which is better to do with locks?

This:

void ProcessHTML(string HTML)
{
    List<string> Links = GetLinks(HTML);
    for (int i = 0; i < Links.Count; i++)
    {
        lock (WebsitesHash)
        {
             lock (AllLinks)
             {
                  if (!WebsitesHash.ContainsKey(Links[i]))
                  {
                       WebsitesHash[Links[i]] = true;
                       AllLinks.Add(Links[i]);                    
                  }
             }
        }
    }
}

Or this:

void ProcessHTML(string HTML)
{
    List<string> Links = GetLinks(HTML);
    lock (WebsitesHash)
    {
        lock (AllLinks)
        {
             for (int i = 0; i < Links.Count; i++)
             {
                  if (!WebsitesHash.ContainsKey(Links[i]))
                  {
                       WebsitesHash[Links[i]] = true;
                       AllLinks.Add(Links[i]);
                  }
             }
        }
    }
}

Which is generally considered better to do - lock in each iteration, or lock all the iterations?

Other code that might be relevant:

void StartCrawl(string Seed)
{
    AllLinks.Capacity = 1000 * 1000 * 10;
    StreamWriter Log = new StreamWriter(File.Open("Websites.txt", FileMode.Append));
    string HTML = GetHTML(Seed);
    ProcessHTML(HTML);
    for (int i = 0; i < AllLinks.Count; i++)
    {
        if (!Work)
        {
             Log.Close();
             WebsitesHash = new Dictionary<string, bool>();
             break;
        }
        Log.WriteLine(AllLinks[i]);
        websText.Text = AllLinks.Count + "";
        try { HTML = GetHTML(AllLinks[i]); }
        catch { continue; }
        Thread Parser = new Thread(() => ProcessHTML(HTML));
        Parser.Start();
    }
}
Nikolay Kostov
  • 16,433
  • 23
  • 85
  • 123
BlueRay101
  • 1,447
  • 2
  • 18
  • 29
  • 1
    I would use ConcurrentDictionary instead. – DixonD Jan 31 '15 at 12:38
  • 1
    There is very little need to do this with multiple threads. The act of performing IO (presumably over the internet) is tens of thousands of times slower than the parsing of the pages. You would get very little benefit using multiple threads. – Enigmativity Jan 31 '15 at 12:40
  • @DixonD - Thank you, but it won't be good enough because I need a list to iterate on with numerical indexes. Web Crawlers are actually supposed to be built with recursive crawling, I just found another method of appending new links to the end of the list currently being iterated to avoid recursion. This is why I cannot switch it to a ConcurrentDictionary. Even if I make one which is , I need another HashTable to avoid crawling one website multiple times. – BlueRay101 Jan 31 '15 at 12:47
  • @Enigmativity - I know web requests are slow and can take anywhere from 50-1000 MS (even more in some cases) and that this is much slower than the parsing of the links. Although that is true, the parsing of the links involve loops with IndexOf, Contains and Replace operations (on strings) which can be quite slow when repeated so many times. I just didn't want the parsing to affect the crawling speed, but I think you're right. Anyway, even if I decide to stick with multiple threads, does anyone have any recommendation about the locks? – BlueRay101 Jan 31 '15 at 12:49

2 Answers2

2

In this case, it won't matter a whole lot.

The links are retreived outside the lock so the only action is adding a handful strings to a list. That is very small so the question is moot.

If the amount of work was larger, locking inside the loop would have been preferable.

And although locks are cheap, you can optimize a little by locking just once. You can use a private object lockObject = new object(); to be clearer about the protocol.

H H
  • 263,252
  • 30
  • 330
  • 514
1

Let AllLinks is global storage of links:

public List<string> AllLinks = new List<string>();

Use List.BinarySearch method somewhere in code for adding new link:

// "link" contain string of html link
lock(AllLinks)
{
    int index = AllLinks.BinarySearch(link);
    if( index < 0 )
    { // link is not in AllLinks
        AllLinks.Add(~index, link);
    }
    else
    { // link exist, "index" contain its position in list
        // ...
    }
}

I think that WebsitesHash object is not necessary.

UPD Additional advantage of using BinarySearch is the sorted state of AllLinks.

  • Wow, thanks a lot! I had no idea List had such a method. Absolutely better than using a HashTable, thanks again. – BlueRay101 Jan 31 '15 at 14:48