7

This is a spin-off from the discussion in some other question.

Suppose I've got to parse a huge number of very long strings. Each string contains a sequence of doubles (in text representation, of course) separated by whitespace. I need to parse the doubles into a List<double>.

The standard parsing technique (using string.Split + double.TryParse) seems to be quite slow: for each of the numbers we need to allocate a string.

I tried to make it old C-like way: compute the indices of the beginning and the end of substrings containing the numbers, and parse it "in place", without creating additional string. (See http://ideone.com/Op6h0, below shown the relevant part.)

int startIdx, endIdx = 0;
while(true)
{
    startIdx = endIdx;
    // no find_first_not_of in C#
    while (startIdx < s.Length && s[startIdx] == ' ') startIdx++;
    if (startIdx == s.Length) break;
    endIdx = s.IndexOf(' ', startIdx);
    if (endIdx == -1) endIdx = s.Length;
    // how to extract a double here?
}

There is an overload of string.IndexOf, searching only within a given substring, but I failed to find a method for parsing a double from substring, without actually extracting that substring first.

Does anyone have an idea?

Community
  • 1
  • 1
Vlad
  • 35,022
  • 6
  • 77
  • 199
  • 5
    Have you proved this is actually a bottleneck? I don't *know* of any way of doing it off-hand, but I'd certainly want some evidence of it being a problem before micro-optimizing. – Jon Skeet Apr 15 '12 at 11:22
  • @Jon: not really. The question is based on the discussion at the linked question (http://stackoverflow.com/questions/10053449/extract-numbers-from-string). Sorry for that. – Vlad Apr 15 '12 at 11:23
  • Fair enough. I suspect that a hand-written parse routine would be slower than the presumably-optimized-with-lots-of-experience method the BCL team has come up with :) – Jon Skeet Apr 15 '12 at 11:25
  • @Jon: I definitely don't want to reinvent the [square] wheel. I looked for a way to use the BCL `Parse` for "my" code. – Vlad Apr 15 '12 at 11:27
  • "seems to be quite slow: for each of the numbers we need to allocate a string." - nonsense. – H H Apr 15 '12 at 15:27
  • @Henk: at least Jon Skeet was not so definite. – Vlad Apr 15 '12 at 17:05
  • @Henk: Sorry, but I didn't consider your comment as order to bring the proofs within "a couple of hours". Anyway, if you want to be useful, you can comment on the original question (http://stackoverflow.com/questions/10053449/extract-numbers-from-string). – Vlad Apr 15 '12 at 18:06
  • 2
    @Henk: thanks a lot for your advice -- but I would refrain from further discussion, as it seems to move from coding questions into personal ones. – Vlad Apr 15 '12 at 18:13
  • 1
    @HenkHolterman you are probably right that this is an irrelevant premature optimization in many use cases. In our case, where we cannot easily pre-process large amounts of data to a more sensible format, and we need to load it on limited platforms, we see a significant overhead due to GC's caused directly by allocations in string.Split. The issues behind question is very relevant to us and, I believe, one of the reasons Span is introduced in C# 7.2. – DuneCat Dec 15 '17 at 13:18

2 Answers2

7

There is no managed API to parse a double from a substring. My guess is that allocating the string will be insignificant compared to all the floating point operations in double.Parse.

Anyway, you can save the allocation by creating a "buffer" string once of length 100 consisting of whitespace only. Then, for every string you want to parse, you copy the chars into this buffer string using unsafe code. You fill the buffer string with whitespace. And for parsing you can use NumberStyles.AllowTrailingWhite which will cause trailing whitespace to be ignored.

Getting a pointer to string is actually a fully supported operation:

    string l_pos = new string(' ', 100); //don't write to a shared string!
    unsafe 
    {
        fixed (char* l_pSrc = l_pos)
        {               
              // do some work
        }
    }

C# has special syntax to bind a string to a char*.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Do I understand correctly: you mean modifying a supposedly immutable `System.String` with unsafe code? – Vlad Apr 15 '12 at 12:23
  • Wouldn't parsing all that whitespace make this actually slower than allocating a new string every time? – svick Apr 15 '12 at 12:28
  • @Vlad, yes you can do that. Just don't pass that string around and keep it private. That way you don't violate assumptions other code makes. StringBuilder uses this technique internally. When you ToString a StringBuilder it just hands you its internal buffer. StringBuilder.ToString often is O(1). – usr Apr 15 '12 at 12:37
  • @svick, you only need to fill very few chars with whitespace if you keep track of how many whitespace chars are already there. – usr Apr 15 '12 at 12:38
  • @usr: but the string doesn't give me legitimate access to its underlying buffer, so isn't this going to be a kind of hack possibly incompatible with the upcoming BCL versions? – Vlad Apr 15 '12 at 12:40
  • 1
    See System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData which is currently hard-coded to 8. – usr Apr 15 '12 at 12:51
  • @usr: Oh, I see, thanks. But than, isn't it simpler to use a `StringBuilder` instead? – Vlad Apr 15 '12 at 12:58
  • 2
    A StringBuilder cnanot be used to parse a double from. When you call ToString the StringBuilders internal buffer string is reset (if it wasn't you could retroactively modify the string that was handed to the application). For the same reason StringBuilder is thread-safe. – usr Apr 15 '12 at 13:06
2

if you want to do it really fast, i would use a state machine

this could look like:

enum State
{
    Separator, Sign, Mantisse etc.
}
State CurrentState = State.Separator;
int Prefix, Exponent, Mantisse;
foreach(var ch in InputString)
{
    switch(CurrentState)
    { // set new currentstate in dependence of ch and CurrentState
        case Separator:
           GotNewDouble(Prefix, Exponent, Mantisse); 


    }

}
user287107
  • 9,286
  • 1
  • 31
  • 47
  • yes, if your are using TryParse, you need every time a new string instance. then you have the same behaviour like var values = string.Split(' ').Select(s => double.Parse(s)).ToArray(); – user287107 Apr 15 '12 at 13:22
  • well, manual parsing tends to be slow and buggy, I'd like to avoid reinventing the wheel if possible. – Vlad Apr 15 '12 at 17:06