0

I have a 200MiB binary file containing pairs of Int64 and Int32. The file is already sorted by the values on the Int64.

I'm reading it like this:

private SortedDictionary<Int64, int> edgeID = new SortedDictionary<Int64, int>();

//Carico il file del mapping arco ID
using (BinaryReader b = new BinaryReader(File.Open(@"file.bin", FileMode.Open)))
{
    Int64 keyID;
    Int32 grafoID;
    int numArchi = b.ReadInt32();

    for (int i = 0; i < numArchi; i++)
    {
        keyID = (b.ReadInt64());
        grafoID = b.ReadInt32();
        edgeID[keyID] = grafoID;
    }
}

But it's super slow. I noticed if I was using a normal Dictionary I could speed up a bit if I pass the dimension of the Dictionary in the costructor but apparently SortedDictionary doesn't support that. Also I think the bigger problem is the program doesn't know I'm passing data already ordered so it checks the key at every insert.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Federico
  • 789
  • 1
  • 6
  • 29
  • 4
    Why load it via a sorted dictionary then? Use a normal dictionary (with appropriate dimensions.. as it stops re-sizing dynamically) and transfer it to a SortedDictionary afterwards.. ? Do you even need the functionality of a SortedDictionary if your data is loaded _already_ sorted? – Simon Whitehead Jan 16 '14 at 10:55
  • @SimonWhitehead I assumed accessing a SortedDictionary by Key is faster than a norma Dictionary. Is this true? – Federico Jan 16 '14 at 10:56
  • 1
    SortedDictionary is very slow compared to a normal Dictionary – harold Jan 16 '14 at 10:57
  • @Federico No. A `SortedDictionary` is a binary tree. A `Dictionary` is a hash table. `HashTable` access is _O(1)_.. binary tree is _O(log n)_. – Simon Whitehead Jan 16 '14 at 11:05

2 Answers2

3

Your second assumption is partially correct: SortedDictionary, in general, has no way of telling that data is ordered and needs to check they are inserted in the right place.

Furthermore, SortedDictionary is a red-black tree internally in the current Mono and Microsoft implementations of the BCL.

Insertion of already ordered data is actually the worst case scenario for a red black tree (see insert section here)

Also, the implementation might change in the future, so optimizing for it/agaist it is a risky move!

In comparison, Dictionary is an hash-table internally (that's why you can give it an initial size; hash-tables are implemented using arrays underneath (of different shapes, depending on the implementation)) and so both insertion and look-up times are faster (amortized, O(1)).

You have to consider: what do you need it for? If you want to just search by key, and the data is already ordered, probably the best way of doing it is loading the data in a simple contiguous array (a .NET array or a list) and use binary search on it. You will need the overload that takes a comparer, as your array will probably hold a Tuple - or you may just use two arrays, one of "indexes" and one of "values".. your choice.

Much faster than SortedDictionary and Dictionary on loading (and should be faster than SortedDictionary in searching too).

It might be that Dictionary is still faster on searching a single item; hash-tables have 0(1) access, while binary search on an ordered array is 0(log N), but locality plays an important role on this "intermediate" data dimensions, so the real result might be different from the theoretical one. You will need to profile and measure to decide which is best in your case!

(By intermediate dimensions I mean: something for which the big-O notation does not play a dominant role, yet)

Community
  • 1
  • 1
Lorenzo Dematté
  • 7,638
  • 3
  • 37
  • 77
  • 1
    Good explanation; however, your last point might be incorrect. Binary search incurs O(log₂n) cost, whilst a hash-based dictionary can perform lookups in O(1) time. – Douglas Jan 16 '14 at 11:10
  • Correct! I was just editing my answer to include this (important) detail. – Lorenzo Dematté Jan 16 '14 at 11:14
0

Using a Dictionary with a max number of elements like this:

using (BinaryReader b = new BinaryReader(File.Open(@"filed", FileMode.Open)))
{
    Int64 tomtomID;
    Int32 grafoID;
    int numArchi = b.ReadInt32();
    edgeID = new Dictionary<long, int>(numArchi);

    for (int i = 0; i < numArchi; i++)
    {
        tomtomID = (b.ReadInt64());
        grafoID = b.ReadInt32();
        edgeID[keyID] = grafoID;
    }
}

This solved the issue.

Abbas
  • 14,186
  • 6
  • 41
  • 72
Federico
  • 789
  • 1
  • 6
  • 29