7

I'm working on a high performance code in which this construct is part of the performance critical section.

This is what happens in some section:

  1. A string is 'scanned' and metadata is stored efficiently.
  2. Based upon this metadata chunks of the main string are separated into a char[][].
  3. That char[][] should be transferred into a string[].

Now, I know you can just call new string(char[]) but then the result would have to be copied.

To avoid this extra copy step from happening I guess it must be possible to write directly to the string's internal buffer. Even though this would be an unsafe operation (and I know this bring lots of implications like overflow, forward compatibility).

I've seen several ways of achieving this, but none I'm really satisfied with.

Does anyone have true suggestions as to how to achieve this?

Extra information:
The actual process doesn't include converting to char[] necessarily, it's practically a 'multi-substring' operation. Like 3 indexes and their lengths appended.

The StringBuilder has too much overhead for the small number of concats.

EDIT:
Due to some vague aspects of what it is exactly that I'm asking, let me reformulate it.

This is what happens:

  1. Main string is indexed.
  2. Parts of the main string are copied to a char[].
  3. The char[] is converted to a string.

What I'd like to do is merge step 2 and 3, resulting in:

  1. Main string is indexed.
  2. Parts of the main string are copied to a string (and the GC can keep its hands off of it during the process by proper use of the fixed keyword?).

And a note is that I cannot change the output type from string[], since this is an external library, and projects depend on it (backward compatibility).

Aidiakapi
  • 6,034
  • 4
  • 33
  • 62
  • 2
    What do you actually have to do with the strings after all this? That is, instead of trying to find ways to map to `string[]` without copying again, can you just bring it in as a `char[]` and then store `int,int` pairs of the position & length of the sub-parts you need, referencing the original array to pull out the substrings whenever you need them? – Jamie Treworgy Jan 11 '12 at 21:03
  • I'm not really sure what code you are trying to enhance here. – Andrew Barber Jan 11 '12 at 21:04
  • 2
    The string class is special; it is by definition immutable and involves copying. Trying to circumvent this is asking for trouble with the GC and other managed code (strings are pooled). – Nikki9696 Jan 11 '12 at 21:17
  • This is for a library, that is, the consumers of the library are getting strings as signature. So I can't change that. I know that strings are pooled, but out of reference sources I also know that a `StringBuilder` for example, internally holds a regular string, and mutates it. It doesn't make use of `char[]` to append. – Aidiakapi Jan 11 '12 at 22:06

4 Answers4

3

I think that what you are asking to do is to 'carve up' an existing string in-place into multiple smaller strings without re-allocating character arrays for the smaller strings. This won't work in the managed world.

For one reason why, consider what happens when the garbage collector comes by and collects or moves the original string during a compaction- all of those other strings 'inside' of it are now pointing at some arbitrary other memory, not the original string you carved them out of.

EDIT: In contrast to the character-poking involved in Ben's answer (which is clever but IMHO a bit scary), you can allocate a StringBuilder with a pre-defined capacity, which eliminates the need to re-allocate the internal arrays. See http://msdn.microsoft.com/en-us/library/h1h0a5sy.aspx.

Community
  • 1
  • 1
Chris Shain
  • 50,833
  • 6
  • 93
  • 125
  • I know the response is late. But I'm not trying to map the substrings as a part of the main string, I do want to copy them, but not copy them to a `char[]` and then to a `string`, I want to map them directly to a `string`. – Aidiakapi Jan 11 '12 at 22:09
  • There is no way that I know of to do that in the CLR. All String constructors, even the unsafe one that takes a pointer to a string array (http://msdn.microsoft.com/en-us/library/6y4za026.aspx), operate by copying the array. – Chris Shain Jan 11 '12 at 22:34
  • Ben Voigt already supplied one way that seems to do that so far. – Aidiakapi Jan 11 '12 at 22:36
  • Ben's code allocates a string and keys into the buffer behind it to modify the values. I was under the impression that wasn't what you are after- if it is, then go for it. – Chris Shain Jan 11 '12 at 22:39
  • As an aside, I **strongly** recommend that you read this: http://www.codinghorror.com/blog/2009/01/the-sad-tragedy-of-micro-optimization-theater.html – Chris Shain Jan 11 '12 at 22:39
  • 1
    In deference to that awesome article, and your other comment that most of these operations only involve a very small number of strings, are you sure that this is in fact the source of your bottleneck? I went through a similar exercise trying to optimize an HTML parser (even to the extent of starting to write the very same kind of unmanaged code you're looking for here) and saw a shockingly small improvement. After commenting out some of the other active parts of the code I realized that the strings handling wasn't even on the radar for the bottleneck (it was object creation somewhere else). – Jamie Treworgy Jan 11 '12 at 22:46
  • I know that's true, but I still want to know the answer >. – Aidiakapi Jan 11 '12 at 22:52
  • If you know most of the time there aren't that many concatenations, what about hardcoding a strategy that will assign the target array element directly for 0, 1, 2.. x (maybe up to 4) parts, and create a StringBuilder if >x? How this would be implemented depends a lot on the logic, but seems like you could cache the info for each piece in the early iterations of a loop (which you presumably have to build the targets) and if the loop terminates – Jamie Treworgy Jan 11 '12 at 23:04
  • That isn't going to work in many cases since the data is very variable. In fact, it's a very rare scenario that causes the same input to be used multiple times, and even then due to a different state the output can be different. Everything is variable, so hardcoded logic isn't going to get far. – Aidiakapi Jan 11 '12 at 23:10
  • I don't see why it matters what the data is. If you're trying to avoid duplicate byte-array copies (once to `char[]`, then again to `string` as a member of your return array) I'm just saying don't build each string automatically --- store pointers to each string in your loop, and then after the loop `if (iterations==1) { target[n] = original.Substring(start[0],length[0]); } else if (iterations==2) { target[n] = original.Substring(start[0],length[0]) + original.substring(start[1],length[1]) }` ... instead of building an intermediate string, just build a list of indices and use them for small x. – Jamie Treworgy Jan 11 '12 at 23:14
  • Btw i'm assuming that these intermediate strings must be of some significant length, otherwise I can't believe there's a lot to be gained. Of course if the cost of adding an `int` to a list is just as high as the cost of creating the `char[]` of only a few characters would be then it makes no difference. – Jamie Treworgy Jan 11 '12 at 23:21
2

Just create your own addressing system instead of trying to use unsafe code to map to an internal data structure.

Mapping a string (which is also readable as a char[]) to an array of smaller strings is no different from building a list of address information (index & length of each substring). So make a new List<Tuple<int,int>> instead of a string[] and use that data to return the correct string from your original, unaltered data structure. This could easily be encapsulated into something that exposed string[].

Jamie Treworgy
  • 23,934
  • 8
  • 76
  • 119
  • 1
    I'm sorry for not making clear that the return type couldn't be changed, because of dependencies. – Aidiakapi Jan 11 '12 at 22:07
  • Do you mean that this function must absolutely accept only a `string` and return only an actual instance of `string[]` (e.g. you can't return `IList`)? If it's for a library, I would think you'd favor a more general return type. – Jamie Treworgy Jan 11 '12 at 22:15
  • `Array` is more specific than `IList` and if the consumers would like to use it as an `IList` then they are free to do so, but I cannot assume that they do, for example if a consumer used it in Array.Copy their code would break. (And they'd have to refactor Length to Count etc.) – Aidiakapi Jan 11 '12 at 22:19
  • I think you're going to have to do some refactoring if you want to optimize this ;) even with unmanaged code I can't think of how it could work without making a copy of the string. You could theoretically create code that mapped another structure onto the string data in memory, but how would you ensure that the string never got gc'd if your library doesn't own it? Seems like many things could go wrong. p.s. i just saw your edit. Maybe you can post some code so we can see what you're trying to do. – Jamie Treworgy Jan 11 '12 at 22:33
2

What happens if you do:

string s = GetBuffer();
fixed (char* pch = s) {
    pch[0] = 'R';
    pch[1] = 'e';
    pch[2] = 's';
    pch[3] = 'u';
    pch[4] = 'l';
    pch[5] = 't';
}

I think the world will come to an end (Or at least the .NET managed portion of it), but that's very close to what StringBuilder does.

Do you have profiler data to show that StringBuilder isn't fast enough for your purposes, or is that an assumption?

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Assumption, because many times there won't even be concats, and mostly when there are, there'll only be 2-4 concats. We're not talking about huge numbers. Let me test the code sample you've supplied :). – Aidiakapi Jan 11 '12 at 22:22
  • I've profiled it now, resulting in (lower is better) 2720 for this method, 4291 with `char[]` and `new string(theArray)`, and finally 5165 for `StringBuilder`. – Aidiakapi Jan 11 '12 at 22:59
  • Do you know if this has side-effects? – Aidiakapi Jan 11 '12 at 23:29
  • @Aidiakapi: If `GetBuffer` creates a brand new "big enough" string every time, that isn't pooled (interned), then I think this is equivalent to how `StringBuilder` writes into a `string`. I was hoping that you would tell me if it has the desired side-effect (e.g. `s` actually does start with `"Result"` after leaving the `fixed` block). – Ben Voigt Jan 12 '12 at 00:06
  • It does actually give the correct result. I'm using this: `new string('\0', length)` as the `GetBuffer()`. I read in an article that this might cause weird side-effects when it comes to comparison and sorting. So I've created several tests and profiled them. The result is no real difference. – Aidiakapi Jan 12 '12 at 13:13
  • @Aidiakapi: Ah, yes, you probably shouldn't do anything that might call `GetHashCode()` until after changing the content, because caching the hash code would be a reasonable thing to do for a string with an immutability assumption. – Ben Voigt Jan 12 '12 at 16:27
  • Since it's internally within the method nothing can be called while mutating the internal buffer. I understand that people dislike unsafe code, but I think that for these purposes it's actually great. Since the strings are dynamically generated they'll never be interned anyway. (Without manually calling it ofc.) – Aidiakapi Jan 15 '12 at 00:42
0

In .NET, there is no way to create an instance of String which shares data with another string. Some discussion on why that is appears in this comment from Eric Lippert.

Community
  • 1
  • 1
Sean U
  • 6,730
  • 1
  • 24
  • 43
  • 1
    He states that it isn't impossible, besides I'm not trying to share data, I'm trying to copy once. – Aidiakapi Jan 11 '12 at 22:31
  • So are you just looking for `String.Substring()`? – Sean U Jan 11 '12 at 22:32
  • 1
    No >.<, like `"string1".Substring(x1, y1) + "string2".Substring(x2, y2) + "string3".Substring(x3, y3)` – Aidiakapi Jan 11 '12 at 22:36
  • Ah, I think I get it. A string ctor that accepts `IEnumerable` would be really helpful there. I think others are right; `StringBuilder` is your best bet. – Sean U Jan 11 '12 at 22:45
  • A `char[]` is an `IEnumerable`, but an `IEnumerable` is not necessarily a `char[]`. – Sean U Jan 11 '12 at 23:04
  • Of course it is, but what'd be faster? A dynamically sized `IEnumerable` or a fixed size `char[]`. Anyway, this is how it's currently implemented. So that's unrelated to the question. – Aidiakapi Jan 11 '12 at 23:13