1

I am working on software for scientific research that deals heavily with chemical formulas. I keep track of the contents of a chemical formula using an internal Dictionary<Isotope, int> where Isotope is an object like "Carbon-13", "Nitrogen-14", and the int represents the number of those isotopes in the chemical formula. So the formula C2H3NO would exist like this:

{"C12", 2
"H1", 3
"N14", 1
"O16", 1}  

This is all fine and dandy, but when I want to add two chemical formulas together, I end up having to calculate the hash function of Isotope twice to update a value, see follow code example.

public class ChemicalFormula {
    internal Dictionary<Isotope, int> _isotopes = new Dictionary<Isotope, int>();

    public void Add(Isotope isotope, int count)
    {
        if (count != 0)
        {
            int curValue = 0;
            if (_isotopes.TryGetValue(isotope, out curValue))
            {
                int newValue = curValue + count;
                if (newValue == 0)
                {
                    _isotopes.Remove(isotope);
                }
                else
                {
                    _isotopes[isotope] = newValue;
                }
            }
            else
            {
                _isotopes.Add(isotope, count);
            }
            _isDirty = true;
        }
    }
}

While this may not seem like it would be a slow down, it is when we are adding billions of chemical formulas together, this method is consistently the slowest part of the program (>45% of the running time). I am dealing with large chemical formulas like "H5921C3759N1023O1201S21" that are consistently being added to by smaller chemical formulas.

My question is, is there a better data structure for storing data like this? I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.

EDIT Isotope is immutable and shouldn't change during the lifetime of the program so I should be able to cache the hashcode.

I have linked to the source code so you can see the classes more in depth rather than me copy and paste them here.

Moop
  • 3,414
  • 2
  • 23
  • 37

4 Answers4

0

I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.

Well it would stop the double hashing... but obviously it's then worse in terms of space. What performance difference did you notice?

Another option you should strongly consider if you're doing this a lot is caching the hash within the Isotope class, assuming it's immutable. (If it's not, then using it as a dictionary key is at least somewhat worrying.)

If you're likely to use most Isotope values as dictionary keys (or candidates) then it's probably worth computing the hash during initialization. Otherwise, pick a particularly unlikely hash value (in an ideal world, that would be any value) and use that as the "uncached" value, and compute it lazily.

If you've got 45% of the running time in GetHashCode, have you looked at optimizing that? Is it actually GetHashCode, or Equals which is the problem? (You talk about "hashing" but I suspect you mean "hash lookup in general".)

If you could post the relevant bits of the Isotope type, we may be able to help more.

EDIT: Another option to consider if you're using .NET 4 would be ConcurrentDictionary, with its AddOrUpdate method. You'd use it like this:

public void Add(Isotope isotope, int count)
{
    // I prefer early exit to lots of nesting :)
    if (count == 0)
    {
        return 0;
    }

    int newCount = _isotopes.AddOrUpdate(isotope, count, 
                                         (key, oldCount) => oldCount + count);
    if (newCount == 0)
    {
        _isotopes.Remove(isotope);
    }
    _isDirty = true;
}
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thanks for the tips and thoughts (early exits I agree with!). I get slightly better performance using the `IsotopeCount` wrapper (~5-7%) but at the cost of memory and added complexity so I didn't think it would be beneficial. I now calculated my own hash function at the creation of `Isotope`s and that has helped. I also overrode the `.Equals()` method to simplify and speed up equality comparisons. I have tried the `ConcurrentDictionary` a while back, but that took a big performance hit (i assume due to the extra locking and thread-safeness). – Moop Jul 31 '12 at 18:34
0

Do you actually require random access to Isotope count by type or are you using the dictionary as a means for associating a key with a value?

I would guess the latter.

My suggestion to you is not to work with a dictionary but with a sorted array (or List) of IsotopeTuples, something like:

class IsotopeTuple{
   Isotope i;
   int count;
}

sorted by the name of the isotope.

Why the sorting?

Because then, when you want to "add" two isotopes together, you can do this in linear time by traversing both arrays (hope this is clear, I can elaborate if needed). No hash computation required, just super fast comparisons of order.

This is a classic approach when dealing with vector multiplications where the dimensions are words. Used widely in text mining.

the tradeoff is of course that the construction of the initial vector is (n)log(n), but I doubt if you will feel the impact.

Vitaliy
  • 8,044
  • 7
  • 38
  • 66
  • I do have some methods that use random-access to `Isotope` but I could probably rewrite those. I could just make `Isotope : IComparable` to handle the sorted list. I might try this and get back. – Moop Jul 30 '12 at 20:51
  • I initially tried doing a `SortedList` and transverse the two arrays, but ran into problems of modifying the Value Collection while in a nested loop, so that didn't work. I moved to the standard (non-generic) `SortedList` and added the needed casts and it works, but at worst performance of the `Dictionary` method. Lastly, I tried a `List` like you suggested but was forced to consistently call `list.Sort()` to make sure it was sorted and that ended up causing performance issues as well. Thanks for the suggestions though – Moop Jul 31 '12 at 18:29
  • Why do you have to constantly add things to the list? Can you elaborate more on the use case? How did you implement the comparer? – Vitaliy Jul 31 '12 at 20:02
  • Another suggestion I have for you is in the area of memory conservation. You say you have BILLIONS of Isotops. that is a lot of strings. The thing is that if I understand correctly, The strings are repeating themselves more often than not (its the count that differs). I would suggest you use String interning (http://msdn.microsoft.com/en-us/library/system.string.intern.aspx) in order to not instantiate multiple instances of the same string. You may be surprised as to the amount of memory it might save you. – Vitaliy Jul 31 '12 at 20:16
  • You have to add new `Isotope` to a chemical formula in a lot of situations (Adding C2H3NO to H2O requires adding both Carbon and Nitrogen to the H2O formula). The comparer just calls a `double.CompareTo()` between the two isotope masses. For your second comment, I do have billions of chemical formulas, but I use a singleton pattern for the isotopes, so there is only one object in existence, so memory of strings is not an issue. – Moop Aug 01 '12 at 00:36
  • You can improve the compare even further by returning the difference between the doubles you are comparing. About the construction of the formulas, remember that yo can always do the computation in parallel. Moreover, if the Isotopes are initially stored in a sorted manner, the addition is can take linear time if implemented like the 'Merge' part of the 'Merge-Sort' algorithm. So the sorting needs to be done only at construction time of the initial formulas. – Vitaliy Aug 01 '12 at 10:33
0

I second the opinion that Isotope should be made immutable with precalculated hash. That would make everything much simpler.

(in fact, functionally-oriented programming is better suited for calculations of such sort, and it deals with immutable objects)

Sergei Rogovtcev
  • 5,804
  • 2
  • 22
  • 35
  • I agree with you and @jonskeet about caching the hashcode for `Isotope` and that has helped performance slightly. But I think the slowdown is internal to the Dictionary and am looking for ways to avoid repeated calls to access the same spot in memory. – Moop Jul 30 '12 at 21:02
  • If you're optimizing for performance (and are not concerned about memory usage), you can assign each isotope an unique `int` index, which will be used as, well, index to array of `int` counts. You'll be using much more memory in every formula, but array access will be faster. – Sergei Rogovtcev Jul 30 '12 at 21:09
  • This is an interesting idea. As other users point out, I am going to be heavily focused on a handful of `Isotope` that I will use all the time (i.e. C, N, H, O, S, P) and could give those a small unique `int` values (i.e. 0 - 20). All other isotopes I could assign a larger unique `int` and if a users adds an unusual element, 'Hf' for example, I would simply resize the internal `int[]` array to accommodate it. This should save memory and improve performance. – Moop Jul 31 '12 at 18:37
  • I accepted your answer because both you and @jonskeet suggested the caching of the hashcode to speed it up, but I really like your comment for the using a different data structure. I implemented a method where the most common isotopes are stored at the front of the array and resize the array if I have too. This basically worked as my own implementation of `Dicitionary` and speed everything up >40%. – Moop Aug 06 '12 at 19:35
0

Another solution that you could think of if you had a limited number of Isotopes and no memory problems:

public struct Formula
{
   public int C12;
   public int H1;
   public int N14;
   public int O16;
}

I am guessing you're looking at organic chemistry, so you may not have to deal with that many isotopes, and if the lookup is the issue, this one will be pretty fast...

edeboursetty
  • 5,669
  • 2
  • 40
  • 67
  • I wrote the code to handle all the elements (>100) and isotopes (>1,000) so that would be almost impossible as you described, especially the .Add method. But I agree, for a very simple formula that would be fast and efficient. – Moop Jul 30 '12 at 20:57