Producer/consumer of a web crawler using queue with unknown size

Question

I need to crawl parent web pages and its children web pages and I followed the producer/consumer concept from http://www.albahari.com/threading/part4.aspx#%5FWait%5Fand%5FPulse. Also, I used 5 threads which enqueue and dequeue links.

Any recommendations on how will I end/join all the threads once all of them have finished processing the queue, given that the length of queue is unknown?

Below is the idea on how I coded it.

static void Main(string[] args)
{
    //enqueue parent links here
    ...
    //then start crawling via threading
    ...
}

public void Crawl()
{
   //dequeue
   //get child links
   //enqueue child links
}

score 3 · Answer 1 · answered Dec 12 '11 at 15:46

If all of your threads are idle (i.e. waiting on the queue) and the queue is empty, then you're done.

An easy way to handle that is to have the threads use a timeout when they're trying to access the queue. Something like BlockingCollection.TryTake. Whenever TryTake times out, the thread updates a field to say how long it's been idle:

while (!queue.TryTake(out item, 5000, token))
{
    if (token.IsCancellationRequested)
        break;
    // here, update idle counter
}

You can then have a timer that executes every 15 seconds or so to check all of the threads' idle counters. If all threads have been idle for some period of time (a minute, perhaps), then the timer can set the cancellation token. That will kill all the threads. Your main program, too, can be monitoring the cancellation token.

You can do this without BlockingCollection and cancellation, by the way. You'll just have to create your own cancellation signaling mechanism, and if you're using a lock on the queue, you can replace the lock syntax with Monitor.TryEnter, etc.

There are several other ways to handle this, although they would require some major restructuring of your program.

Tudor · Answer 2 · 2011-12-12T16:24:50.263

1

You can enqueue a dummy token at the end and have the threads exit when they encounter this token. Like:

public void Crawl()
{
   int report = 0;
   while(true)
   {
       if(!(queue.Count == 0))      
       {   
          if(report > 0) Interlocked.Decrement(ref report);
          //dequeue     
          if(token == "TERMINATION")
             return;
          else
             //enqueue child links
       }
       else
       {              
          if(report == num_threads) // all threads have signaled empty queue
             queue.Enqueue("TERMINATION");
          else
             Interlocked.Increment(ref report); // this thread has found the queue empty
       }
    }
}

Of course, I have omitted the locks for enqueue/dequeue operations.

edited Dec 12 '11 at 16:24

answered Dec 12 '11 at 15:33

Tudor

61,523
12
102
142

I don't see where that's going to solve the problem. You have to know where the end is before you can queue the dummy token. – Jim Mischel Dec 12 '11 at 15:35
@Jim Mischel: Well there has to be a way to know, like no more child links to process. – Tudor Dec 12 '11 at 15:37
My point is that his original question was, in essence, "how do I know I'm at the end?" Your answer is, essentially, "when you're at the end, queue an end token." – Jim Mischel Dec 12 '11 at 15:48
Hmm determining when will the crawler know there are no more links to process is the main bottleneck. Could setting a timer might help? – user611333 Dec 12 '11 at 15:49
@user611333 assuming that you're crawling a finite number of pages, then you should eventually be able to figure out where the end is. If you want to stop crawling before the end of the queue, then the question you're asking is not really relevant to that case. – Kiril Dec 12 '11 at 16:03
@Tudor, good solution, however there is a flaw: suppose there are 5 threads and 100k links total distributed on multiple pages. If thread 1 through 4 are assigned to crawl pages with no links from the start, then they'll enqueue the termination token, while thread 5 finds 1 pages with 10 links (which eventually lead to the other 999,995 links on the website) and queues them after thread 1 through 4 have queued the termination tokens, then thread 5 will have to crawl all of the remaining URLs by itself. – Kiril Dec 12 '11 at 16:10
I have hacked together a different solution. Please see the edit. – Tudor Dec 12 '11 at 16:18
@Lirik I think terminating immediately when there's no more to dequeue is not the case as the threads might still be in the middle of performing some actions like saving to the db. (in reply to next comment)Hmm I think I already experienced similar to this. Before Im enqueueing nulls as termination signs, making some threads end already while other threads are still busy digging down. – user611333 Dec 12 '11 at 16:24
@user611333 terminating when there is nothing else to dequeue is not based on time, it's based on availability. In other words, your threads can do all the stuff that they need to do with the database, but they'll eventually come back to the queue and either enqueue more work into it or dequeue work from it. Tudor's updated solution seems the most viable one since it doesn't care about time, it just cares if there is more data to be en-queued or not. – Kiril Dec 12 '11 at 16:36

score 0 · Answer 3 · answered Dec 12 '11 at 15:34

0

The threads could signal that have ended their work raising an event for example, or calling a delegate.

static void Main(string[] args)
{
//enqueue parent links here
...
//then start crawling via threading
...
}

public void X()
{
    //block the threads until all of them are here
}

public void Crawl(Action x)
{
    //dequeue
    //get child links
    //enqueue child links
    //call x()
}

answered Dec 12 '11 at 15:34

Ignacio Soler Garcia

21,122
31
128
207

Yes could be yet since the child links could also be parent links, the threads will not exactly know if their work has already ended. – user611333 Dec 12 '11 at 15:49

score 0 · Answer 4 · answered Dec 12 '11 at 17:44

There is really no need to handle the producer-consumer stuff manually if you are willing to use the Task Parallel Library. When you create tasks with the AttachToParent option the child tasks will link with the parent task in such a manner that it will not complete until child tasks have completed.

class Program
{
    static void Main(string[] args)
    {
        var task = CrawlAsync("http://stackoverflow.com");
        task.Wait();
    }

    static Task CrawlAsync(string url)
    {
        return Task.Factory.StartNew(
            () =>
            {
                string[] children = ExtractChildren(url);
                foreach (string child in children)
                {
                    CrawlAsync(child);
                }
                ProcessUrl(url);
            }, TaskCreationOptions.AttachedToParent);
    }

    static string[] ExtractChildren(string root)
    {
      // Return all child urls here.
    }

    static void ProcessUrl(string url)
    {
      // Process the url here.
    }
}

You could remove some of the explicit task creation logic by using Parallel.ForEach.

Producer/consumer of a web crawler using queue with unknown size

4 Answers4