3

I have a dataset. This dataset will serve a lookup table. Given a number, I should be able to lookup a corresponding value for that number.

The dataset (let's say its CSV) has a few caveats though. Instead of:

1,ABC
2,XYZ
3,LMN

The numbers are ranges (- being "through", not minus):

1-3,ABC     // 1, 2, and 3 = ABC
4-8,XYZ     // 4, 5, 6, 7, 8 = XYZ
11-11,LMN   // 11 = LMN

All the numbers are signed ints. No ranges overlap with another ranges. There are some gaps; there are ranges that aren't defined in the dataset (like 9 and 10 in the last snippet above). `

How might I model this dataset in C# so that I have the most-performant lookup while keeping my in-memory footprint low?

The only option I've come up with suffers from overconsumption of memory. Let's say my dataset is:

1-2,ABC
4-6,XYZ

Then I create a Dictionary<int,string>() whose key/values are:

1/ABC
2/ABC
4/XYZ
5/XYZ
6/XYZ

Now I have hash performance-lookup, but tons of wasted space in the hash table.

Any ideas? Maybe just use PLINQ instead and hope for good performance? ;)

SnickersAreMyFave
  • 5,047
  • 8
  • 26
  • 24
  • For large ranges you can use range compression (e.g. treat some ranges as blocks); additionally you can use arrays (which can be well-packed) as as bitmasks or indirect lookups (see arootbeer's answer). –  Oct 11 '10 at 22:29
  • So you just don't like the space used? If you are in fact just using numbers and strings I'm curious if .NET caches all the strings for you appropriately and you are only really wasting the space for all the pointers which might be unavoidable. Otherwise there is probably some kind of tree data structure that could be used if the double look-up doesn't do it. – Sean Copenhaver Oct 11 '10 at 22:38

5 Answers5

4

You can create a doubly-indirected lookup:

Dictionary<int, int> keys;
Dictionary<int, string> values;

Then store the data like this:

keys.Add(1, 1);
keys.Add(2, 1);
keys.Add(3, 1);
//...
keys.Add(11, 3);

values.Add(1, "ABC");
//...
values.Add(3, "LMN");

And then look the data up:

return values[keys[3]];  //returns "ABC"

I'm not sure how much memory footprint this will save with trivial strings, but once you get beyond "ABC" it should help.

EDIT

After Dan Tao's comment below, I went back and checked on what he was asking about. The following code:

var abc = "ABC";
var def = "ABC";
Console.WriteLine(ReferenceEquals(abc, def));

will write "True" to the console. Which means that the either the compiler or the runtime (clarification?) is maintaining the reference to "ABC", and assigns it as the value of both variables.

After reading up some more on Interned strings, if you're using string literals to populate the dictionary, or Interning computed strings, it will in fact take more space to implement my suggestion than the original dictionary would have taken. If you're not using Interned strings, then my solution should take less space.

FINAL EDIT

If you're treating your strings correctly, there should be no excess memory usage from the original Dictionary<int, string> because you can assign them to a variable and then assign that reference as the value (or, if you need to, because you can Intern them)

Just make sure your assignment code includes an intermediate variable assignment:

while (thereAreStringsLeftToAssign)
{
    var theString = theStringToAssign;
    foreach (var i in range)
    {
        strings.Add(i, theString);
    }
}
Matt Mills
  • 8,692
  • 6
  • 40
  • 64
  • +1 Very interesting, doubling the constant which still leaves it O(1) – Davy8 Oct 11 '10 at 22:34
  • Well actually you would get O(2) since you're doing a lookup in two hash tables, but yeah, nifty. – SnickersAreMyFave Oct 11 '10 at 22:38
  • 2
    @SnickersAreMyFave: The constant is removed out in Big-O, so it's still O(1). O(constant) is always O(1), no matter what the constant. – Reed Copsey Oct 11 '10 at 22:41
  • So how do you express that it takes twice as long, then? Would you say the complexity for this algorithm above is: O(1) + O(1)? – SnickersAreMyFave Oct 11 '10 at 22:47
  • How large do you think this data structure would need need to be to store the two range keys: {1,100000}, {100000,80000000}? This can rapidly become a more expensive options than just performing a binary-search solution. – LBushkin Oct 11 '10 at 22:48
  • @Snickers there is no such thing as O(1) + O(1) or O(2). There is no way to represent that something takes twice as long in Big-O notation. It doesn't measure things on that "small" of a scale. – Davy8 Oct 11 '10 at 23:02
  • If something takes 99n^2 +36n + 158 time, it has the same Big-O as 0.0000n^2. They're both O(n^2). – Davy8 Oct 11 '10 at 23:04
  • I must be missing something. How would this save any memory? It seems to me it would just cost more. – Dan Tao Oct 11 '10 at 23:22
  • It really only saves memory in the situation described by the question: each value in the original dictionary is a unique copy of the string referenced, rather than having each value contain a reference to the same string (as other answers have pointed out). – Matt Mills Oct 11 '10 at 23:35
  • @arootbeer: Where would these unique copies come from? I don't mean to be argumentative, but I really can't imagine *any* scenario where this would be the case... presumably, if the OP wants to define a value for a specified range of keys, he/she must *have* a `string` representing that value and can use this instance for each of the `int` keys within that range. If the string is known at compile-time, of course, it is interned; but even if it is only determined at run-time (e.g., if supplied by user input), it seems the OP would have to *deliberately* make a bunch of copies to waste memory. – Dan Tao Oct 12 '10 at 02:15
  • To propose a scenario, imagine that each string was produced by calling `ToString` on an object (using an override that doesn't `Intern`), rather than by assigning an intermediate variable. It's not something you *should* do, but it's something you easily *could* do. – Matt Mills Oct 12 '10 at 02:39
  • @arootbeer: I see what you mean. But this would essentially amount to a mistake on the developer's part. It seems (to me) that to use a `Dictionary` which merely provides a key with which to perform a lookup on a `Dictionary` would be quite a lot of effort simply to avoid creating duplicate string instances, which the developer could do far more easily by simply calling `ToString` *once* and using the resulting value for each key within a range. – Dan Tao Oct 12 '10 at 02:46
  • @Dan Tao: That's true. But if you desired to go to that effort, this seems to me to be a straightforward way to go about it :) – Matt Mills Oct 12 '10 at 02:48
4

If your dictionary is going to truly store a wide range of key values, an approach that expands all possible ranges into explicit keys will rapidly consume more memory than you likely have available.

You're best option is to use a data structure that supports some variation of binary search (or other O(log N) lookup technique). Here's a link to a generic RangeDictionary for .NET that uses an OrderedList internally, and has O(log N) performance.

Achieving constant-time O(1) lookup requires that you expand all ranges into explicit keys. This requires both a lot of memory, and can actually degrade performance when you need to split or insert a new range. This probably isn't what you want.

LBushkin
  • 129,300
  • 32
  • 216
  • 265
1

As arootbeer has mentioned in his answer, the following code does not create multiple instances of the string "ABC"; rather, it interns a single instance and assigns a reference to that instance to each KeyValuePair<int, string> in dictionary:

var dictionary = new Dictionary<int, string>();
dictionary[0] = "ABC";
dictionary[1] = "ABC";
dictionary[2] = "ABC";

// etc.

OK, so in the case of string literals, you're only using one string instance per range of keys. Is there a scenario where this wouldn't be the case--that is, where you would be using a separate string instance for each key within the range (this is what I assume you're concerned about when you speak of "overconsumption of memory")?

Honestly, I don't think so. There are scenarios where multiple equivalent string instances may be created without the benefit of interning, yes. But I can't imagine these scenarios would affect what you're trying to do here.

My reasoning is this: you want to assign certain values to different ranges of keys, right? So any time you are defining a key-range-value pairing of this sort, you have a single value and several keys. The single part is what leads me to doubt that you'll ever have multiple instances of the same string, unless it is defined as the value for more than one range.

To illustrate: yes, the following code will instantiate two identical strings:

string x = "ABC";

Console.Write("Type 'ABC' and press Enter: ");
string y = Console.ReadLine();

Console.WriteLine(Equals(x, y));
Console.WriteLine(ReferenceEquals(x, y));

The above program, assuming the user follows instructions and types "ABC," outputs True, then False. So you might think, "Ah, so when a string is only provided at run-time, it isn't interned! So this could be where my values could be duplicated!"

But... again: I don't think so. It all comes back to the fact that you are going to be assigning a single value to a range of keys. So let's say your values come from user input; then your code would look something like this:

var dictionary = new Dictionary<int, string>();

int start, count;
GetRange(out start, out count);
string value = GetValue();

foreach (int key in Enumerable.Range(start, count))
{
    // Look, you're using the same string instance to assign
    // to each key... how could it be otherwise?
    dictionary[key] = value;
}

Now, if you were actually thinking more along the lines of what LBushkin mentions in his answer--that you may potentially have huge ranges, making it impractical to define a KeyValuePair<int, string> for each key within that range (e.g., if you have a range of 1-1000000)--then I would agree that you're best off with some sort of data structure that bases its lookup on a binary search. If that's more your scenario, say so and I will be happy to offer more ideas on that front. (Or you could just take a look at the link LBushkin already posted.)

Community
  • 1
  • 1
Dan Tao
  • 125,917
  • 54
  • 300
  • 447
0

arootbeer has a good solution, but one you may find confusing to work with.

Another choice is to use a reference type instead of a string, so that you point to the same reference

class StringContainer { 
    public string Value { get; set; }
}

Dictionary<int, StringContainer> values;

var value1 = new StringContainer { Value = "ABC" };
values.Add(1, value1);
values.Add(2, value1);

They will both point to the same instance of StringContainer

EDIT: Thanks for the comments everyone. This method handles value types other than string, so it might be useful for more than the given example. Also, it is my understanding that strings don't always behave in the manner you would expect from reference values, but I could be wrong.

Luke Schafer
  • 9,209
  • 2
  • 28
  • 29
  • 3
    Strings _are_ reference types in C# / .NET – Wesley Hill Oct 11 '10 at 22:42
  • Yep, strings are actually references types. I had thought that the CLR actually caches and reuses string instances. Since they are immutable in .NET this is a safe thing to do. I can't find anything to confirm this though and don't have access to .NET on Windows right now. – Sean Copenhaver Oct 11 '10 at 22:51
  • 1
    You could also call `string.Intern` for the same effect without creating a new class. Also if the value of the string is known at compile time it is automatically interned. http://msdn.microsoft.com/en-us/library/system.string.intern.aspx – Davy8 Oct 11 '10 at 22:54
  • I think I read somewhere that strings are actually mutable in IL, but that functionality isn't used in any (popular?) .NET languages. I can't find the reference though unfortunately. – Davy8 Oct 11 '10 at 22:59
  • @Davy8 ah, thanks for the link. I'm pretty sure that's what I was thinking about. – Sean Copenhaver Oct 11 '10 at 23:08
  • 1
    @Davy8: Strings are mutable within mscorlib - that is, there are `internal` methods on the `String` class that mutate the current instance. Of course, there's nothing to stop you calling those methods from your own code using reflection, but it's inadvisable. – LukeH Oct 11 '10 at 23:21
0

Use a balanced ordered tree (or something similar) mapping start-of-range to end-of-range and data. This will be easy to implement for non-overlapping ranges.

Rafe
  • 5,237
  • 3
  • 23
  • 26