0

I have the following code:

       List<HashSet<String>> authorLists = new List<HashSet<String>>
       // fill it
        /** Remove duplicate authors  */
        private void removeDublicateAuthors(HashSet<String> newAuthors, int curLevel)
        {

            for (int i = curLevel - 1; i > 0; --i)
            {
                HashSet<String> authors = authorLists[i];
                foreach (String item in newAuthors)
                {
                    if (authors.Contains(item))
                    {
                        newCoauthors.Remove(item);
                    }
                }
            }
        }

How to remove items correctly? I need to iterate through newAuthors and authorLists. RemoveWhere cannot be used here by this reason.

It is very inefficient to create new list, add items to them and then remove duplicate items. In my case, list of authorLists has following sizes:

authorLists [0].size = 0;
authorLists [1].size = 322;
authorLists [2].size = 75000; // (even more than this value)

I need to call removeDublicateAuthors 1*(1)322(n)75000(m) times where n and m are the sizes of duplicate authors on the 1st and 2nd levels correspondingly. I have to delete these items very often and the size of array is very large. So, this algorithm is very inefficient. Actually I have the following code in Java and to rewrite it by some reasons:

/** Remove duplicate authors in tree of Authors*/

private void removeDublicateAuthors(HashSet<String> newCoauthors, int curLevel ) {

for(int i = curLevel - 1; i > 0; --i) {
    HashSet<String> authors = coauthorLevels.get(i);
    for (Iterator<String> iter = newCoauthors.iterator(); iter.hasNext();) {
        iter.next();
        if(authors.contains(iter)) {
            iter.remove();
        }
    }
}
}

It works much faster than suggested options at the moment

Ondrej Janacek
  • 12,486
  • 14
  • 59
  • 93
user565447
  • 969
  • 3
  • 14
  • 29

4 Answers4

3

You can add the items you want to remove in another hashset and then remove them all afterwards.

Ehsan
  • 31,833
  • 6
  • 56
  • 65
1

What you are doing here is wrong because of 2 reasons: 1. you cannot alter a set you are parsing through - sintax problem 2. even if you make your code work, you will only alter the value, not the reference - logic problem

   List<HashSet<String>> authorLists = new List<HashSet<String>>
   // fill it
   /** Remove duplicate authors  */
   // handle reference instead of value
   private void removeDublicateAuthors(ref HashSet<String> newAuthors, int curLevel)
   {
       List<string> removeAuthors = new List<string>();

       for (int i = curLevel - 1; i > 0; --i)
       {
           HashSet<String> authors = authorLists[i];
           foreach (String item in newAuthors)
           {
               if (authors.Contains(item))
               {
                   removeAuthors .Add(item);
               }
           }
       }

       foreach(string author in removeAuthors)
       {
           newAuthors.Remove(author);
       }
   }
Sergiu Mindras
  • 194
  • 1
  • 17
0

What you're looking for is ExceptWith. You're trying to find the set of one set subtracted from another, which is exactly what that method does.

Servy
  • 202,030
  • 26
  • 332
  • 449
  • no, I no not need an intersection, i need to ensure that hashset A doesn't include any items in hashset B – user565447 Mar 18 '14 at 18:26
  • @user565447 Right, didn't look closely enough; you want `ExceptWith`. Either way, the method is already there to do exactly this. – Servy Mar 18 '14 at 18:31
-2

Forgive me if I don't understand what you are trying to do.

Hash sets don't allow duplicates because the index of an item is the hash of the item. Two equal strings would have the same hash, and therefore the same index. Therefore if you simply combine any two hash sets, the result is free from duplicates.

Consider the following:

        var set1 = new HashSet<string>();
        set1.Add("foo");
        set1.Add("foo");

        var set2 = new HashSet<string>();
        set2.Add("foo");

        var set3 = set1.Union(set2);

        foreach (var val in set3)
        {
          Console.WriteLine(val);   
        }

The output of this code would be:

foo

Now if you are trying to ensure that hashset A doesn't include any items in hashset B, you could do something like this:

        var set1 = new HashSet<string>();
        set1.Add("foo");
        set1.Add("bar");

        var set2 = new HashSet<string>();
        set2.Add("foo");
        set2.Add("baz");

        foreach (var val in set2)
        {
            set1.Remove(val);
        }

        foreach (var val in set1)
        {
            Console.WriteLine(val);    
        }

The output of which would be:

bar

Giving this some more thought, you can subtract one set from another using the .Except method.

var set3 = set1.Except(set2);

This produces all the items in set1 that are not in set2

William Leader
  • 814
  • 6
  • 20
  • 1
    Your intro paragraph is very wrong. Hashes are not unique. It's possible for unequal objects to have the same hash value. This is called a collision, and any hash based data structure needs to deal with that correctly (they do so through a defined `Equals` method). Duplicates aren't allowed because the collection is logically a set, which doesn't allow duplicates. A "bag" is an unordered collection that allows duplicates. You can most certainly create an implementation of a hash based bag structure, if you wanted to. It's not particularly hard either. – Servy Mar 18 '14 at 18:34
  • @Servy, while you are technically correct, you are getting into the implementation details that the asker need not worry about. – William Leader Mar 18 '14 at 18:39
  • @WilliamLeader If they are implementation details that need not be worried about, then why are you emphasizing them in your answer? – Servy Mar 18 '14 at 18:40