1

I wonder why C# does not have a version of long.Parse accepting an offset in the string and length. In effect, I am forced to call string.Substring first.

This is unlike C strtol where one does not need to extract the substring first.

If I need to parse millions of rows I have a feeling there will be overhead creating those small strings that immediately become garbage.

Is there a way to parse a string into numbers efficiently without creating temporary short lived garbage strings on the heap? (Essentially doing it the C way)

mark
  • 59,016
  • 79
  • 296
  • 580
  • 10
    Rather than "having a feeling" have you benchmarked your code and compared the performance with your well-defined performance requirements? Maybe this *is* a problem - but it's much better to check that first than *assume* it's a problem and write more complicated code than you need to work around something that isn't actually an issue. – Jon Skeet Jul 28 '17 at 19:21
  • (If it *does* prove to be a problem, I have code at https://github.com/nodatime/nodatime/blob/master/src/NodaTime/Text/ValueCursor.cs#L114 that you could amend to your needs...) – Jon Skeet Jul 28 '17 at 19:23
  • 3
    It doesn't have that because no one ever bothered to write it. If you want it, write it! If you think it will help other people, submit a pull request. – Eric Lippert Jul 28 '17 at 19:28
  • Ah, my favourite @EricLippert reply. :) – Chris Jul 28 '17 at 19:30
  • @JonSkeet I guess you are right. I am dumping the parsed data into the database anyway, so the overhead associated with that is likely to dwarf anything I do with respect to parsing itself. – mark Jul 28 '17 at 21:12

1 Answers1

2

Unless I'm reading this wrong, strtol doesn't take an offset into the string. It takes a memory address, which the caller can set to any position within a character buffer (or outside the buffer, if they aren't paying attention).

This presents a couple issues:

  1. Computation of the offset requires an understanding of how the string is encoded. I believe c# uses UTF16 for in-memory strings, currently anyway. if that were ever to change, your offsets would be off, possibly with disastrous results.

  2. Computation of the address could easily go stale for managed objects since they are not pinned in memory-- they could be moved around by memory management at any time. You'd have to pin it in memory using something like GCHandle.Alloc. When you're done, you'd better unpin it, or you could have serious problems!

  3. If you get the address wrong, e.g. outside your buffer, your program is likely going to blow up.

I think C programmers are more accustomed to managing memory mapped objects themselves and have no issue computing offsets and addresses and monkeying around with them like you would with assembly. With a managed language like c# those sorts of things require more work and aren't typically done-- the only time we pin things in memory is when we have to pass objects off to unmanaged code. When we do it, it incurs overhead. I wouldn't advise it if your overall goal is to improve performance.

But if you are hell bent on getting down to the bare metal on this, you could try this solution where one clever c# programmer would read the string as an array of ASCII-encoded bytes and compute the numbers based on that. With his solution you can specify start and length to your heart's content. You'd have to write something different if your strings are encoded in UTF. I would go this route rather than trying to hack the string object's memory mapping.

John Wu
  • 50,556
  • 8
  • 44
  • 80