Why does .NET create new substrings instead of pointing into existing strings?

Question

From a brief look using Reflector, it looks like String.Substring() allocates memory for each substring. Am I correct that this is the case? I thought that wouldn't be necessary since strings are immutable.

My underlying goal was to create a IEnumerable<string> Split(this String, Char) extension method that allocates no additional memory.

I haven't thought about it very hard, or looked at StringBuilder's implementation with Reflector, but would an IEnumerable Split(this StringBuilder, Char) method work? — Domenic, Jul 04 '09 at 16:53
If String.Substring() dont allocate new memory, string dont will be Immutable — Felipe Pessoto, Jul 06 '09 at 14:42

score 24 · Accepted Answer · answered Jul 04 '09 at 16:29

24

One reason why most languages with immutable strings create new substrings rather than refer into existing strings is because this will interfere with garbage collecting those strings later.

What happens if a string is used for its substring, but then the larger string becomes unreachable (except through the substring). The larger string will be uncollectable, because that would invalidate the substring. What seemed like a good way to save memory in the short term becomes a memory leak in the long term.

answered Jul 04 '09 at 16:29

SingleNegationElimination

151,563
33
264
304

1

I thought the main reason was in regards to algorithms over the strings. If you can safely assume that a string will never change you can pass references to it safely and it's also inherently threadsafe. I guess that ties in with garbage collection too. – Spence Jul 04 '09 at 16:39
1

@Spence - that is a reason for immutability. It's not a reason for avoiding shared buffers between strings. Once you have immutability and GC, you can easily implement shared buffers behind the scenes without breaking thread safety or existing algorithms. – Daniel Earwicker Jul 05 '09 at 09:07

score 2 · Answer 2 · answered Jul 04 '09 at 15:49

Not possible without poking around inside .net using String classes. You would have to pass around references to an array which was mutable and make sure no one screwed up.

.Net will create a new string every time you ask it to. Only exception to this is interned strings which are created by the compiler (and can be done by you) which are placed into memory once and then pointers are established to the string for memory and performance reasons.

score 1 · Answer 3 · answered Jul 04 '09 at 16:08

Each string has to have it's own string data, with the way that the String class is implemented.

You can make your own SubString structure that uses part of a string:

public struct SubString {

   private string _str;
   private int _offset, _len;

   public SubString(string str, int offset, int len) {
      _str = str;
      _offset = offset;
      _len = len;
   }

   public int Length { get { return _len; } }

   public char this[int index] {
      get {
         if (index < 0 || index > len) throw new IndexOutOfRangeException();
         return _str[_offset + index];
      }
   }

   public void WriteToStringBuilder(StringBuilder s) {
      s.Write(_str, _offset, _len);
   }

   public override string ToString() {
      return _str.Substring(_offset, _len);
   }

}

You can flesh it out with other methods like comparison that is also possible to do without extracting the string.

Yes, it's easy for the SubString structure to create another that is part of itself. — Guffa, Jul 05 '09 at 11:59

Philippe Leybaert · Answer 4 · 2009-07-04T16:20:01.457

0

Because strings are immutable in .NET, every string operation that results in a new string object will allocate a new block of memory for the string contents.

In theory, it could be possible to reuse the memory when extracting a substring, but that would make garbage collection very complicated: what if the original string is garbage-collected? What would happen to the substring that shares a piece of it?

Of course, nothing prevents the .NET BCL team to change this behavior in future versions of .NET. It wouldn't have any impact on existing code.

edited Jul 04 '09 at 16:20

answered Jul 04 '09 at 15:55

Philippe Leybaert

168,566
31
210
223

6

Java's String actually does it that way: Substrings are merely pointers into the original string. However, that also means that when you take a 200-character substring of a 200-MiB string, the 200-MiB string will always lie around in memory as long as the small substring isn't garbage-collected. – Joey Jul 04 '09 at 16:00
I think it could impact existing code given that it is designed around this behaviour. If people assume that interning their string will stop it from being duplicated and this behaviour was stopped it could cause working apps to stop with out of memory exceptions. – Spence Jul 04 '09 at 16:32
How can you design around this behavior? Because of the immutability of strings, there's really no way to create code that would break if the internal implementation of the string class changes. – Philippe Leybaert Jul 04 '09 at 16:36
2

.Net string operations indeed create new string objects, but it's not *because* strings are immutable. In fact, it's because strings are immutable that string operations *could* reuse current string objects instead of creating new ones. – Rob Kennedy Jul 04 '09 at 16:39
If C# used this approach, it wouldn't make garbage collection any different. The original string would have multiple references to it, and so it would not be garbage collected until all substrings based on it were also unreachable. Hence what Joey says. Java has faster substring, potentially much higher memory use, and C# has slow substring, potentially much more efficient memory use. – Niall Connaughton Sep 05 '15 at 08:37

score 0 · Answer 5 · answered Jul 04 '09 at 19:41

0

Adding to the point that Strings are immutable, you should be that the following snippet will generate multiple String instances in memory.

String s1 = "Hello", s2 = ", ", s3 = "World!";
String res = s1 + s2 + s3;

s1+s2 => new string instance (temp1)

temp1 + s3 => new string instance (temp2)

res is a reference to temp2.

answered Jul 04 '09 at 19:41

Babak Naffas

12,395
3
34
49

This sounds like something that the compiler folks could optimize. – Ian Boyd Jul 04 '09 at 20:16
It's not an issue with the compiler, it's a choice made in designing the language. Java has the same rules for Strings. System.Text.StringBuilder is a good class to use that simulates the "mutable" strings. – Babak Naffas Jul 04 '09 at 20:26
1

Wrong - s1 + s2 + s3 gets turned into a single call to String.Concat. This is why it is NOT better to use String.Format or StringBuilder (which are both comparatively slow), for up to 4 strings. Look at the IL to see what the compiler does, and use a profiler to find out what performs well in your program. Otherwise you might as well be saying "Look, it is a shoe! He has removed his shoe and this is a sign that others who would follow him should do likewise!" Please post factual answers instead of mythical ones. – Daniel Earwicker Jul 05 '09 at 09:03
i.e. Ian Boyd's comment is right (except that the compiler folks already took care of it in version 1.) – Daniel Earwicker Jul 05 '09 at 09:04
As per the C# Languge Reference, the + operator on a string is defined as: string operator +(string x, string y); string operator +(string x, object y); string operator +(object x, string y); While the implementation of the operator may use the Concat method, it doesn't change the fact that + is a binary operator; hence, s1 + s2 + s3 would be the equivalent of String.Concat( String.Concat( s1, s2), s3) with a new string object returned for each call to Concat() – Babak Naffas Jul 06 '09 at 18:24

Why does .NET create new substrings instead of pointing into existing strings?

5 Answers5

Linked