13

I am currently working on a very large legacy application which handles a large amount of string data gathered from various sources (IE, names, identifiers, common codes relating to the business etc). This data alone can take up to 200 meg of ram in the application process.

A colleague of mine mentioned one possible strategy to reduce the memory footprint (as a lot of the individual strings are duplicate across the data sets), would be to "cache" the recurring strings in a dictionary and re-use them when required. So for example…

public class StringCacher()
{
    public readonly Dictionary<string, string> _stringCache;

    public StringCacher()
    {
        _stringCache = new Dictionary<string, string>();
    }   

    public string AddOrReuse(string stringToCache)
    {
        if (_stringCache.ContainsKey(stringToCache)
            _stringCache[stringToCache] = stringToCache;

        return _stringCache[stringToCache];
    }
}

Then to use this caching...

public IEnumerable<string> IncomingData()
{
    var stringCache = new StringCacher();

    var dataList = new List<string>();

    // Add the data, a fair amount of the strings will be the same.
    dataList.Add(stringCache.AddOrReuse("AAAA"));
    dataList.Add(stringCache.AddOrReuse("BBBB"));
    dataList.Add(stringCache.AddOrReuse("AAAA"));
    dataList.Add(stringCache.AddOrReuse("CCCC"));
    dataList.Add(stringCache.AddOrReuse("AAAA"));

    return dataList;
}

As strings are immutable and a lot of internal work is done by the framework to make them work in a similar way to value types i'm half thinking that this will just create a copy of each the string into the dictionary and just double the amount of memory used rather than just pass a reference to the string stored in the dictionary (which is what my colleague is assuming).

So taking into account that this will be run on a massive set of string data...

  • Is this going to save any memory, assuming that 30% of the string values will be used twice or more?

  • Is the assumption that this will even work correct?

Martin Cooper
  • 439
  • 1
  • 4
  • 11
  • 4
    This is a mistake, 30% is not nearly enough to justify making your program a hundred times slower. RAM is cheap and plentiful, 8 gigabytes costs 67 bucks. You can't write a line of code for $1.64 – Hans Passant May 19 '13 at 16:47
  • 1
    +1 to @HansPassant for working out the time vs. RAM ROI. – Andy Brown May 19 '13 at 17:24
  • 1
    @HansPassant Thanks for pointing this out. I'll make sure I do performance testing when implementing. I agree, that memory in your average PC is dirt cheap these days, but unfortunately when talking about production workstations in a large financial institute, where all memory (and any other part) has to be purchased and installed through a specific provider, pushes the real cost of 8 gig to over 500 bucks per workstation. Multiply this by 1000+ users and you can see why machine upgrades are not really an option. – Martin Cooper May 20 '13 at 17:10

3 Answers3

13

This is essentially what string interning is, except you don't have to worry how it works. In your example you are still creating a string, then comparing it, then leaving the copy to be disposed of. .NET will do this for you in runtime.

See also String.Intern and Optimizing C# String Performance (C Calvert)

If a new string is created with code like (String goober1 = "foo"; String goober2 = "foo";) shown in lines 18 and 19, then the intern table is checked. If your string is already in there, then both variables will point at the same block of memory maintained by the intern table.

So, you don't have to roll your own - it won't really provide any advantage. EDIT UNLESS: your strings don't usually live for as long as your AppDomain - interned strings live for the lifetime of the AppDomain, which is not necessarily great for GC. If you want short lived strings, then you want a pool. From String.Intern:

If you are trying to reduce the total amount of memory your application allocates, keep in mind that interning a string has two unwanted side effects. First, the memory allocated for interned String objects is not likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String object can persist after your application, or even your application domain, terminates. ...

EDIT 2 Also see Jon Skeets SO answer here

Community
  • 1
  • 1
Andy Brown
  • 18,961
  • 3
  • 52
  • 62
  • A good set of the data probably won't be around for the lifetime of the application, so maybe in this case, it would be more efficient to store them in a dictionary which I can clear when the data sets are no longer required. – Martin Cooper May 19 '13 at 16:35
  • It sounds sensible. String interning is perfect for literals and constants defined in code, for localisation strings that take up significant space and can benefit from "deduplication", for [CMS](https://en.wikipedia.org/wiki/Content_management_system) style apps that keep strings in memory. But if you are, for example, pulling down html from a web server, processing sections from it and then throwing them all away then you may be better off with your deduplication pool. – Andy Brown May 19 '13 at 16:58
  • @Moog, also note: `_stringCache[stringToCache] = stringToCache;` as you have it written may well duplicate that string (once for the key, once for the value), I'm not sure as I'm running out the door - but worth checking. – Andy Brown May 19 '13 at 17:01
  • 1
    @Moog. Nope, just checked the BCL code - _from what I can work out_ you are ok. `Dictionary` does no funky stuff other than also working out and storing the string hashcode and using this to speed up comparisons (so for lengthy strings, that could actually be a benefit). – Andy Brown May 19 '13 at 17:22
  • _".NET will do this for you in runtime"_ -- no, it will not. The example you cite involves string _literals_, which are handled at compile time. No interning of strings is done at runtime, unless _explicitly_ done by the user code, by calling the `string.Intern()` method. – Peter Duniho Feb 28 '21 at 18:12
3

This is already built-in .NET, it's called String.Intern, no need to reinvent.

oleksii
  • 35,458
  • 16
  • 93
  • 163
  • OK, great, i was wasn't aware of that!! So using this method to cache strings would have a noticeable effect on the memory footprint? Would it impact much the performance if calling String.Intern on hundreds of thousands of strings? – Martin Cooper May 19 '13 at 15:46
  • It should decrease memory consumption and improve performance. You need to test it to be able to see the impact on your application. – oleksii May 19 '13 at 15:48
  • 2
    @Moog. Careful though - interned strings live for the lifetime of the AppDomain, so they aren't GC'd. If you want short-lived strings your pool idea might be better (see my comment in my answer) – Andy Brown May 19 '13 at 16:00
3

You can acheive this using the built in .Net functionality.

When you initialise your string, make a call to string.Intern() with your string.

For example:

dataList.Add(string.Intern("AAAA"));

Every subsequent call with the same string will use the same reference in memory. So if you have 1000 AAAAs, only 1 copy of AAAA is stored in memory.

Rob Aston
  • 816
  • 12
  • 19