0

I have Dictionary<string,T> where string represents the key of record, and I have two other pieces of information about the record that I need to maintain for each record in the dictionary, which are the category of the record and its redundancy (how many times its repeated).

For example: the record XYZ1 is of category 1, and its repeated 1 times. therefore the implementation has to be something like this:

"XYZ1", {1,1}

Now moving on, I may encounter the same record in my dataset, therefore the value of the key has to be updated like:

"XYZ1", {1,2} "XYZ1", {1,3} ...

Since I am processing big number of records such as 100K, I tried this approach but it seems inefficient because the extra effort of fetching the value from dictionary and then slicing {1,1} and then converting both slices into integer puts lot of overhead on the execution.

I was thinking of using binary digits to represent both category and repatation and maybe bitmask to fetch these pieces.

Edit: I tried to use object with 2 properties, and then Tuple<int,int>. Complexity got worse !

My question: is it possible to do so ?

if not (in terms of complexity) any suggestions?

  • Yes, it's possible. Not sure how much it's going to buy you, though. Why don't you make T (your value type above) a class/struct that has the two properties you need. – James R. Apr 01 '16 at 20:24
  • I tried to use object with 2 properties, and then Tuple. Complexity got worse ! – hassan alrehamy Apr 01 '16 at 20:32

2 Answers2

0

It seems like category never changes. So rather than using a simple string for the key of your dictionary, I would instead do something like:

Dictionary<Tuple<string,int>,int> where the key of the dictionary is a Tuple<string,int> where the string is the record and the int is the category. Then the value in the dictionary is just a count.

A dictionary is probably going to be the fastest data structure for what you're trying to accomplish as it has near constant time O(1) lookup and entry.

You can speed it up a little bit by using the Tuple, as now the category is part of the key and no longer a bit of information you have to access separately.

At the same time you could also keep the string as the key and store a Tuple<int,int> as the value and simply set Item1 as the category and Item2 as the count.

Either way is going to be roughly equivalent in speed. Processing 100k records in such a manner should be pretty fast either way.

Ayo I
  • 7,722
  • 5
  • 30
  • 40
0

What is your type T? You could define a custom type which holds the information you need (category and occurences) .

class MyInfo {
  public int c { get; set; } 
  public int o { get; set; }
}

Dictionary<String, MyInfo> data;

Then when traversing your data you can easily check whether some key is already present. If yes, just increment the occurences, else insert a new element.

MyInfo d;
foreach (var e in elements) {
    if (!data.TryGet(e.key, out d))
        data.Add(e.key, new MyInfo { c = e.cat, o= 1});
    else
        d.o++;
}

EDIT

You could also combine the category and the number of occurences into one UInt64. For instance take the category in the higher 32 bit (ie you can have 4 billion categories) and the number of occurenes in the lower 32 bit (ie each key can occur 4 billion times)

Dictionary<string, UInt64> data;

UInt64 d;
foreach (var e in elements) {
    if (!data.TryGet(e.key, out d)) 
       data[e.key] = (e.cat << 32) + 1;
    else 
        data[e.key] = d + 1;

}

And if you want to get the number of occurrences for one specific key you can just inspect the respective part of the value.

var d = data["somekey"];
var occurrences = d & 0xFFFFFFFF;  
var category = d >> 32;  
derpirscher
  • 14,418
  • 3
  • 18
  • 35