3

I am writing a multi-threaded program to scrape a certain site and collect ID's. It is storing these ID's in a shared static List<string> object.

When any item is added to the List<string>, it is first checked against a HashSet<string> which contains a blacklist of already collected ID's.

I do this as follows:

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();

public static void AddIDToIDList(string ID)
{
    lock (IDList)
    {
        if (IsIDBlacklisted(ID))
            return;
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    lock (Blacklist)
    {
        if (Blacklist.Contains(ID))
            return true;
    }
    return false;
 }

The Blacklist is saved to a file after finishing and is loaded every time the program starts, therefore, it will get pretty large over time (up to 50k records). Is there a more efficient way to not only store this blacklist, but also to check each ID against it?

Thanks!

doppelgreener
  • 4,809
  • 10
  • 46
  • 63
blizz
  • 4,102
  • 6
  • 36
  • 60
  • 1
    It should be fairly easy to test this by creating a large, fake Blacklist and using the System.Diagnostics.Stopwatch class to time operations. – Greg Aug 01 '13 at 03:54
  • Thanks but I'm not sure of a better alternative to test against, hence the question. – blizz Aug 01 '13 at 03:57
  • possible duplicate of [What .NET collection provides the fastest search](http://stackoverflow.com/questions/1009107/what-net-collection-provides-the-fastest-search) – Greg Aug 01 '13 at 04:03
  • When does `Blacklist` get modified? – Brian Gideon Aug 01 '13 at 04:15
  • Almost confused myself - this blacklist is not being modified until all operations have ceased. Then it saves to a text file which will be loaded next run. – blizz Aug 01 '13 at 04:47

4 Answers4

3

To improve performance try to use ConcurrentBag<T> collection. Also there is no need to lock BlackList because it's not being modified e.g.:

private static HashSet<string> Blacklist = new HashSet<string>();
private static ConcurrentBag<string> IDList = new ConcurrentBag<string>();

public static void AddIDToIDList(string ID)
{
    if (Blacklist.Contains(ID))
    {
        return;
    }

    IDList.Add(ID);
}
Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
  • 1
    That's very interesting. This provides better performance than HashSet? Also, is it possible to not allow duplicates with ConcurrentBag? – blizz Aug 01 '13 at 04:51
  • So as I understand I must use sets to obtain a "no duplicates" rule, or write my own collection. If so, this does not achieve my purpose. – blizz Aug 01 '13 at 05:00
  • 2
    @blizz, This code has completely same behavior as yours but better performance. – Kirill Polishchuk Aug 01 '13 at 05:36
  • 1
    @blizz Your current code doesn't enforce a "no duplicates" rule either. `List` is no set. – CodesInChaos Aug 01 '13 at 06:24
  • @CodesInChaos Yes, it does. It checks against the blacklist and doesn't add to the list if it is in there. If it does add the ID to the list, it also adds it to the blacklist (which isn't shown here). The blacklist uses a HashSet which allows no duplicates. – blizz Aug 01 '13 at 16:32
  • @blizz "it also adds it to the blacklist" which changes your question a lot and actually invalidates part of this answer. When asking a question, please include all relevant information. It also directly contradicts your "this blacklist is not being modified until all operations have ceased" claim. – CodesInChaos Aug 01 '13 at 16:59
2

Read operations are thread safe on HashSet, as long as Blacklist is not being modified you don't need to lock on it. Also you should lock inside the blacklist check so the lock is taken less often, this also will increase your performance.

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();

public static void AddIDToIDList(string ID)
{
    if (IsIDBlacklisted(ID))
        return;
    lock (IDList)
    {
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    return Blacklist.Contains(ID);
}

If Blacklist is being modified the best way to lock around it is using a ReaderWriterLock (use the slim version if you are using a newer .NET)

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();
private static ReaderWriterLockSlim BlacklistLock = new ReaderWriterLockSlim();

public static void AddIDToIDList(string ID)
{
    if (IsIDBlacklisted(ID))
        return;
    lock (IDList)
    {
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    BlacklistLock.EnterReadLock();
    try
    {
        return Blacklist.Contains(ID);
    }
    finally
    {
        BlacklistLock.ExitReadLock();
    }
}

public static bool AddToIDBlacklist(string ID)
{
    BlacklistLock.EnterWriteLock();
    try
    {
        return Blacklist.Add(ID);
    }
    finally
    {
        BlacklistLock.ExitWriteLock();
    }
}
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
  • What is the benefit of using ReaderWriterLockSlim over a static lock object? – blizz Aug 01 '13 at 05:01
  • ReaderWriterLock allows unlimited concurrent readers, it only acts like a lock when somone wants to write (then it blocks both writers and readers) – Scott Chamberlain Aug 01 '13 at 05:05
  • But `ReaderWriterLock(Slim)`s per-call overhead is larger than that of `lock`, so chances are good that it's slower even in mostly-read scenarios. – CodesInChaos Aug 01 '13 at 06:24
  • @CodesInChaos Do you have anything to back that up? In a highly contended read-frequently write-infrequently situation I would expect that ReadWriterLock would way out perform as you are never needing to block your threads except on the rare write occasions. I could see in single threaded situations or situations when there is no contention `lock` would cheaper, but then why are you locking if the resource is never going to be contented? – Scott Chamberlain Aug 01 '13 at 06:49
  • 1
    @ScottChamberlain I don't remember the source and it might be outdated. But what remember is that for fine-grained locking `lock` is typically better, since its overhead is smaller. For coarse grained locking, `ReadWriterLockSlim` pulls ahead if it's a mostly-read scenario. **Benchmark your specific scenario with both kinds locks**. Don't simply assume one of them is faster. – CodesInChaos Aug 01 '13 at 06:52
  • @CodesInChaos On that I totally agree **don't take the advice from random people on the internet blindly, bechmark and see what works best in the situation you are in!** – Scott Chamberlain Aug 01 '13 at 06:53
1

Two considerations - First, if you use the indexer of a .NET dictionary (i.e., System.Collections.Generic.Dictionary) like this (rather than calling the Add() method):

idList[id] = id;

then it will add the item if it doesn't already exist - otherwise, it will replace the existing item at that key. Second, you can use the ConcurrentDictionary (in the System.Collections.Concurrent namespace) for thread-safety so you don't have to worry about the locking yourself. Same comment applies about using the indexer.

Steve Michelotti
  • 5,223
  • 1
  • 25
  • 30
  • 1
    Interesting note - +1. Unfortunately, it doesn't answer my question about efficiency. – blizz Aug 01 '13 at 05:02
  • Actually, I was saying this is *more* efficient because you can just call the indexer with 1 line of code rather than doing a Contains() check followed by an Add(). – Steve Michelotti Aug 01 '13 at 05:06
1

In your scenario, yes, HashSet is the best option for this since it contains one value to look up unlike a Dictionary which requires a key and a value to do a lookup.

And ofcourse as others have said no need of locking HashSet if it is not being modified. and consider marking it as readonly.

doppelgreener
  • 4,809
  • 10
  • 46
  • 63
Ehsan
  • 31,833
  • 6
  • 56
  • 65