4

In C++/CLI, What's the most efficient way to convert an array of strings to native char**?

I am doing this:

array<String^>^ tokenArray = gcnew array<String^> {"TokenONE", "TokenTWO"};
int numTokens = tokenArray->Length;
char** ptr = new char* [numTokens];
for(int i = 0; i < numTokens; i++)
    {
        // See: http://stackoverflow.com/questions/6596242/
        array<Byte>^ encodedBytes = Text::Encoding::UTF8->GetBytes(tokenArray[i]);
        pin_ptr<Byte> pinnedBytes = &encodedBytes[0];
        ptr[i] = reinterpret_cast<char*>(pinnedBytes);
    }
int myResult = someNativeFunction(ptr, numTokens);
delete ptr;
// ...

What, if anything should be improved? Is this ok from a memory management point of view? I can change the parameters of someNativeFunction if need be.

Thank you.

OG Dude
  • 936
  • 2
  • 12
  • 22
  • 3
    "What could be improved" - probably not to use manual memory management and pointers at all and instead use `std::string`s on the native side. – Kerrek SB Aug 25 '11 at 13:17
  • 5
    One major problem is that your `pin_ptr`s go out of scope before you use them. – Dark Falcon Aug 25 '11 at 13:21
  • I'm a bit confused by why you need tokenArray at all. Since you aren't using it, why not just make ptr directly point to C-style strings? Even if you are using it, since it doesn't seem to be dynamic, and is small, why wouldn't you create ptr separately anyway? – Ed Bayiates Aug 25 '11 at 18:12
  • To reiterate what DarkFalcon said, you're passing `someNativeFunction` pointers to uninitialized memory, which will result in memory corruption. – ildjarn Aug 25 '11 at 18:52
  • @Kerrek: And what if you're calling a native 3rd-party library that only takes `char**` (and for the sake of argument, let's say it *does* modify the strings)? – Adam Rosenfield Aug 25 '11 at 20:05
  • @Adam: Then you write a cleanly defined interface to that legacy function :-) – Kerrek SB Aug 25 '11 at 20:08
  • @Kerrek : I believe *this* is intended to be that cleanly-defined interface. – ildjarn Aug 25 '11 at 20:30
  • @Ildjarn: Hm, I would probably package away interfaces between managed/unmanaged and C++/legacy into separate blocks, but sure, might as well do it all in one... a matter of taste and reusability opportunities. – Kerrek SB Aug 25 '11 at 20:33
  • @Kerrek : However, on the efficiency side, marshaling all the data twice instead of once is a bit silly. :-] – ildjarn Aug 25 '11 at 20:34
  • @ildjarn: Well, it depends. If the efficiency is worth the pile-of-code, then sure... if factorisation helps you clean up your codebase, then that's a trade-off worth considering. – Kerrek SB Aug 25 '11 at 20:36
  • @ildjarn The code is part of a .NET C++/CLI wrapper around native C++. `tokenArray` is a method argument which I pass by reference from .NET - I just included a dummy definition in my post for brevity. Does this change anything about `pin_ptr`s going out of scope? Thank you. – OG Dude Aug 26 '11 at 07:39
  • 1
    @OG : No, that doesn't change anything -- your `pin_ptr`s go out of scope at each iteration of the `for` loop, unpinning their contents, so by the time you call `someNativeFunction` every element of `ptr` is pointing at random data. – ildjarn Aug 26 '11 at 17:30
  • @ildjarn: Ok, I get it. Is there a way to not go out of scope? I tried creating a managed array with `array>^` to hold the pointer I pin at each iteration but that is not allowed. – OG Dude Aug 29 '11 at 11:15

1 Answers1

4

Apart from the problem with pinned pointers going out of scope before being passed to someNativeFunction(), the code can be simplified for better clarity especially if you're using MSVC2008 or newer. See this page for information on how to convert a single string (extending to an array should be trivial).

Edited:

If you need ANSI strings const char* then making a copy is inevitable since .NET Strings are Unicode (UTF-16). On MSVC2008 and newer, your code may look as follows:

#include <msclr/marshal.h>
using namespace msclr::interop;

marshal_context context;
array<String^>^ tokenArray = gcnew array<String^> {"TokenONE", "TokenTWO"};
char** tokensAsAnsi = new char* [tokenArray->Length];

for(int i = 0; i < tokenArray->Length; i++)
{
    tokensAsAnsi[i] = context.marshal_as<const char*>(tokenArray[i]);
}
int myResult = someNativeFunction(ptr, tokensAsAnsi);

// The marshalled results are freed when context goes out of scope
delete[] tokensAsAnsi;    // Please note you must use delete[] here!

This does similar job to your code sample but without the need of pointer pinning and reinterpret_cast-ing.

If you are willing to deal with wide string const wchar_t* in someNativeFunction(), you can use the (pinned) internal data directly, However, you'll have to ensure the pointers remain pinned until someNativeFunction() returns which, as pointed out in the comments, may negatively influence the GC performance.

If you're about to marshall many strings and performance is of utmost concern, you could split the marshalling across several threads before passing everything to someNativeFunction(). Before doing that, I'd reccommend profiling your application to see if the conversion really is a bottleneck or whether it's better to focus efforts elsewhere.

Edited #2:

To get the native string in UTF-8 encoding, you can do with a modified version of your code:

array<String^>^ tokenArray = gcnew array<String^> {"TokenONE", "TokenTWO"};
char** tokensAsUtf8 = new char* [tokenArray->Length];

for(int i = 0; i < tokenArray->Length; i++)
{
    array<Byte>^ encodedBytes = Text::Encoding::UTF8->GetBytes(tokenArray[i]);

    // Probably just using [0] is fine here
    pin_ptr<Byte> pinnedBytes = &encodedBytes[encodedBytes->GetLowerBound(0)];

    tokensAsUtf8[i] = new char[encodedBytes->Length + 1]; 
    memcpy(
        tokensAsUtf8[i], 
        reinterpret_cast<char*>(pinnedBytes),
        encodedBytes->Length
        );

    // NULL-terminate the native string
    tokensAsUtf8[i][encodedBytes->Length] = '\0'; 

}
int myResult = someNativeFunction(ptr, tokensAsAnsi);

for(int i = 0; i < tokenArray->Length; i++) delete[] tokensAsUtf8[i];
delete[] tokensAsUtf8;    

If you're concerned about speed, you could pre-allocate a large buffer for the native strings (if you know there will only be a limited amount) or use pool storage.

Edited #3:(OG Dude) Just fixed some minor typos.

OG Dude
  • 936
  • 2
  • 12
  • 22
Martin Gunia
  • 1,091
  • 8
  • 12
  • 1
    `gcnew marshal_context` is cringe-inducing -- use stack semantics and no `delete` is necessary. – ildjarn Aug 25 '11 at 20:31
  • @ildjarn: That's true. I just copied it over from the support page. – Martin Gunia Aug 26 '11 at 07:41
  • @Martin Gunia, Thank you. I am aware of that KB article. If I understand correctly, "marshaling" implies creating a copy of the data, which is more expensive than pinning a pointer. – OG Dude Aug 26 '11 at 07:55
  • @OG Dude: Yes, usually this is the case. If you want to avoid copy at any cost, then pin the internal string pointer returned from `PtrToStringChars`. However, remember that .NET strings are Unicode (i.e. each Char is coded as UTF-16). For that reason, marshalling (and making an explicit copy) to `const char*` is much safer and necessary if you want to use `const char*` and not `const wchar_t*`. – Martin Gunia Aug 26 '11 at 13:32
  • 1
    @OG : Whether or not marshaling is cheaper or more expensive than pinning data depends on the overall allocation rate of your application; pinning managed objects can severely hinder the GC if there's already a lot of allocation churn, so in some scenarios marshaling is in fact cheaper. – ildjarn Aug 26 '11 at 17:32
  • @Martin Gunia: On the subject of .NET UTF16 -> const char*. Assuming my wrapped method expects "valid UTF8 bytes" as input. If you check the code in my post I do a `UTF8->GetBytes(a_net_string)` then call `reinterpret_cast` on the first (pinned) element of the resulting array which yields my input argument. Should be ok that way, no? @ildjarn: - Ok, I've never tested marshaling vs. pinning performance. Thank you for both of your replies. – OG Dude Aug 29 '11 at 10:04
  • 1
    Yes, if you want the `const char*` string to be UTF-8 encoded then your way is fine. You just have to either copy the chars to another buffer or keep the data pinned (e.g. by putting the pinned pointers to a container) until you're finished with them. – Martin Gunia Aug 29 '11 at 10:43
  • @Martin Gunia: Alright. As for the pointer scope issue, I tried storing the pinned pointers in a managed array (see comment section on original post) but that's not allowed. What kind of container would do? – OG Dude Aug 29 '11 at 11:29
  • Sorry for misinformation, you cannot put pin_ptr into a container, it has to be a stack variable. So I would immediately `memcpy()` the pinned pointer (see [this question](http://stackoverflow.com/q/6944288/898146)). That may anyway be better even if you have many strings to avoid memory fragmentation. – Martin Gunia Aug 29 '11 at 12:11
  • @Martin Gunia - so you're suggesting to copy the data at each pinned pointer to a native buffer? Just asking since in the [question](http://stackoverflow.com/q/6944288/898146) you linked `memcpy` is used to copy a native array to a managed array (whereas I need the other direction). – OG Dude Aug 31 '11 at 10:34
  • 1
    Yes, that's what I'd do, pinning the pointers only until memcpy-ed to own buffer. The linked atricle deals with the other direction but you can just swap the memcpy arguments, they're both native buffers now anyway. I've added code to the answer. – Martin Gunia Aug 31 '11 at 13:09
  • @Martin Gunia, Thank you - I will try that. I appreciate the time you spent on this. In the meanwhile I implemented a solution based on an array of `IntPtr` which holds the results of a `Marshal::Copy` call for each "token". I then convert these to char* pointers. I got the impression that `Marshal::Copy` does not do what it's supposed to. The copied `char*` contains a non-deterministic number of extra characters. I created a very baseline copy-paste example for just one string [here](http://codepad.org/UpJAJhWO). What is up with that? – OG Dude Aug 31 '11 at 18:30
  • Actually, `UTF8::GetBytes()` does not encode the terminating character (\0) so it needs to be added manually to the native string. That is why `strlen()` and `printf` overflows until it finds \0 byte. I'll have to fix my code above :) – Martin Gunia Sep 01 '11 at 05:54
  • Ah, that was it. Still, I prefer your edited solution to the one I outlined in my previous comment. Thanks again. – OG Dude Sep 01 '11 at 14:14