19

I have an application that have ~1,000,000 strings in memory for performance reasons. My application consumes ~200 MB RAM.

I want to reduce the amount of memory consumed by the strings.

I know .NET represents strings in UTF-16 encoding (2 byte per char). Most strings in my application contain pure english chars, so storing them in UTF-8 encoding will be 2 times more efficient than UTF-16.

Is there a way to store a string in memory in UTF-8 encoding while allowing standard string functions? (My needs including mostly IndexOf with StringComparison.OrdinalIgnoreCase).

DxCK
  • 4,402
  • 7
  • 50
  • 89
  • What about using a byte array or `List`? Not sure how much more difficult it would be to work with these objects for your needs though. – Jason Down Mar 09 '12 at 18:55
  • @DxCK, You "want" or "need"? The difference is important to provide either interesting answers or practical ones. – Alexei Levenkov Mar 09 '12 at 18:56
  • 3
    Do you absolutely have to load all 1,000,000 strings into memory? Can you provide more details on what exactly are you doing with all these strings in memory? – Dean Kuga Mar 09 '12 at 18:58
  • I don't know anywhere enough about it to write an actual answer - but if you turn string interning off (there's an assembly setting you can use) that should stop the CLR caching strings. This will reduce memory if you're strings aren't very long lived. Although it'll be even worse if you have loads of the same string – RichK Mar 09 '12 at 18:59
  • 7
    Why is 200MB a problem, do you have a problem with low memory conditions or out of memory conditions? – Lasse V. Karlsen Mar 09 '12 at 19:06
  • @Alexei Levenkov I want and need. – DxCK Mar 09 '12 at 19:07
  • @Dean K. Yes, it is in-memory for performance reasons. – DxCK Mar 09 '12 at 19:07
  • @Lasse V. Karlsen♦ Because it is a desktop application. My users are very satisfied with the performance but not with the memory footprint. So i trying to improve it. – DxCK Mar 09 '12 at 19:12
  • 6
    But again, is 200MB a problem? Does your users have little memory available? Note that I'm not saying 200MB is acceptable either, it depends on the application, but typically, when people "complain" about memory usage and applications, they don't consider that they have all that memory available for exactly one reason; making applications run fast! – Lasse V. Karlsen Mar 09 '12 at 19:13
  • 2
    I like to liken memory usage to garage space usage. If you have a big garage with place for about 10 cars, why are you quibbling about a square foot on a bench in the corner? – Lasse V. Karlsen Mar 09 '12 at 19:15
  • 1
    @LasseV.Karlsen - If you take care of the cents, the dollars will take care of themselves. – daniloquio Mar 09 '12 at 19:20
  • 1
    @Lasse V. Karlsen♦ My users including but not limited to all home users. – DxCK Mar 09 '12 at 19:23
  • @DxCK in such a situation I would use an in-memory-DB and SQL for the string searches etc. - is that an option for you ? – Yahia Mar 09 '12 at 19:24
  • If the users will not need to install components on their machine, then Yes this is an option. This can affect the whole application functionality and performance. need to make a POC of that. Is there any recommendation of such a DB? – DxCK Mar 09 '12 at 19:28
  • @DxCK It's not clear if you are if your build configuration is x86 or AnyCPU? If you targeted x86 rather than 64-bit you could reduce memory usage a fair bit. – cspolton Mar 09 '12 at 19:30
  • @DxCK Also, have you considered compressing the strings greater than 1kb in memory? I read an article mentioning that Stack Overflow compress the contents of their caches. See http://www.hanselman.com/blog/TheWeeklySourceCode35ZipCompressingASPNETSessionAndCacheState.aspx – cspolton Mar 09 '12 at 19:39
  • @Spolto Thanks for the comment. For now, I targeting AnyCPU because I want to run on any Windows machine (x86 or x64) and using x64 benefits in performance and memory scale when available. Targeting x86 will reduce memory footprint for sure, but the strings still remains the dominant memory comsumers. – DxCK Mar 09 '12 at 19:43
  • 200mb strings sounds like a lot of characters. Is this for log files or large XML files? (If so, then customer should probably be running the app on a meaty desktop or server) – Chris S Mar 09 '12 at 19:45
  • I see people trying to guess what exactly my application does... so if you want you can download and see by yourself: http://www.master-seeker.com – DxCK Mar 09 '12 at 19:51
  • I hate to be the bearer of bad news DxCK, but everything already does this using the NTFS index in Windows :) It's still an interesting question though. – Chris S Mar 09 '12 at 19:53
  • @Chris S I know Everything. But there are some differences. To see some of them, just try to search kernel32.dll in both applications (not in parallel!) my app will give more results. Also, my app support FAT32, displays the size of folders and can sort fastly by size and datetimes. Anyway, I dont know if this is the right place to discuss about it. – DxCK Mar 09 '12 at 20:08
  • Are all the strings unique values? – Mike Mar 09 '12 at 21:01
  • @Mike most of them unique, but good point. I will try also some deduplication. Thanks! – DxCK Mar 09 '12 at 21:02

5 Answers5

13

Unfortunately, you can't change .Net internal representation of string. My guess is that the CLR is optimized for multibyte strings.

What you are dealing with is the famous paradigm of the Space-time tradeoff, which states that in order to gain memory you'll have to use more processor, or you can save processor by using some memory.

That said, take a look at some considerations here. If I were you, once established that the memory gain will be enough for you, do try to write your own "string" class, which uses ASCII encoding. This will probably suffice.

UPDATE:

More on the money, you should check this post, "Of memory and strings", by StackOverflow legend Jon Skeet which deals with the problem you are facing. Sorry I didn't mentioned it right away, it took me some time to find the exact post from Jon.

Community
  • 1
  • 1
Bruno Brant
  • 8,226
  • 7
  • 45
  • 90
4

Is there a way to store a string in memory in UTF-8 encoding while allowing standard string > functions? (My needs including mostly IndexOf with StringComparison.OrdinalIgnoreCase).

You could store as a byte array, and provide your own IndexOf implementation (since converting back to string for IndexOf would likely be a huge performance hit). Use the System.Text.Encoding functions for that (best bet would be to do a build step to convert to byte, and then read the byte arrays from disk - only converting back to string for display, if needed).

You could store them in a C/C++ library, letting you use single byte strings. You probably wouldn't want to marshal them back, but you could possibly just marshal results (I assume there's some sort of searching going on here) without too much of a perf hit. C++/CLI may make this easier (by being able to write the searching code in C++/CLI, but the string "database" in C++).

Or, you could revisit your initial performance issues that needs all of the strings in memory. An embedded database, indexing, etc. may both speed things up and reduce memory usage - and be more maintainable.

Mark Brackett
  • 84,552
  • 17
  • 108
  • 152
  • How can one implement an IgnoreCase compare of characters? Is there any UTF-8 library/representation available in C/C++? – DxCK Mar 09 '12 at 19:48
  • @DxCK the problem you have is if you limit yourself to 8 bits, you don't support a large portion of languages used in the world, even with C++ and UTF8 – Chris S Mar 09 '12 at 19:59
  • 1
    @Chris S How does UTF8 encoding limit the language? – DxCK Mar 09 '12 at 20:02
  • I suggest creating a simple lookup table with the case conversion precomputed - since you will be encoding to 8 bits, you would require a 256 entry table and you could convert simply by doing a lookup (e.g `byte lowChar = _lowcaseTable[upperChar];`) –  Mar 09 '12 at 20:09
  • @DxCK - that sounds like another SO question. ;) Boost has a string library, there's various Windows APIs, etc. If you go the byte array route, you'd need to provide your own, I think. I'm afraid that goes beyond my limited knowledge of Unicode - though I suppose you could do worse than do the easy ASCII compare for pure ASCII sequences, and defer to the BCL for complicated Unicode compares. – Mark Brackett Mar 09 '12 at 20:34
  • 1
    @sgorozco - I think you're confusing UTF-8 (which is Unicode) with plain ol' ASCII. UTF-8 stores the ASCII characters as single byte, but is variable-width to store the rest of Unicode. – Mark Brackett Mar 09 '12 at 20:40
  • In other words UTF8 is great for English and European languages, but limiting your strings to 8 bits for memory optimizations will mean Hindu, Chinese aren't supported – Chris S Mar 10 '12 at 00:09
  • @ChrisS: Again, UTF8 doesn't *limit* to 8-bits. It just optimizes the English characters (using 1 byte) at the expense of a few 3-byte sequences (which UTF16 doesn't use). Otherwise, it's fairly similar to UTF16. – Mark Brackett Mar 10 '12 at 17:04
  • @MarkBrackett I know that, I think the point I was trying to make to DxCK was you're probably going to have to return to a `string` at some point unless you check the encoding and do what you mention, write your own indexof – Chris S Mar 13 '12 at 11:31
2

What if you store it as a bytearray? Just restore to string when you need to do some operations on it. I'd make a class for setting & getting the strings which internally stores it off as bytearrays.

to bytearray:

string s = "whatever";
byte[] b = System.Text.Encoding.UTF8.GetBytes(s);

to string:

string s = System.Text.Encoding.UTF8.GetString(b);
SpoBo
  • 2,100
  • 2
  • 20
  • 28
  • 1
    I tried that. Converting back to String has hardly performance costs: allocating memory, converting from UTF-8 to UTF-16, then GC it. for 1,000,000 string it is very noticeable costs. – DxCK Mar 09 '12 at 19:30
  • @DxCK "then GC it" - what do you mean by that? – H H Mar 09 '12 at 19:37
  • well what do you want ... performance or a smaller footprint? :) Does your app continuously need every single string? If not perhaps only store off strings that haven't been used in a while. Make a class that does some sort of internal 'memory collecting' instead of garbage collecting. – SpoBo Mar 09 '12 at 19:45
  • I'm guessing a byte array is no good, as he needs to search the strings – Chris S Mar 09 '12 at 19:56
  • well you could use bytes array and have good performance if your rewrite the String class but with your preferred char encoding. Yay remember data structures. – Patrick Lorio Mar 09 '12 at 20:08
  • جب یہ اردو ہے کیا ہوتا ہے؟ (What happens when it's Urdu?) – Chris S Mar 10 '12 at 00:12
2

try using an in-memory-DB for as "storage" and SQL to interact with the data... For example SQLite can be deployed as part of your application (consists just of 1-2 DLLs which can be placed in the same folder as your application)...

Yahia
  • 69,653
  • 9
  • 115
  • 144
0

What if you create your own UTF-8 string class (UTF8String?) and supply an implicit cast to String? You'll be sacrificing some speed for the sake of memory, but that might be what you're looking for.

itsme86
  • 19,266
  • 4
  • 41
  • 57
  • I tried that. Converting back to String has hardly performance costs. converting from UTF-8 to UTF-16, then GC it. for 1,000,000 string it is very noticeable costs. – DxCK Mar 09 '12 at 19:04