0

Several months ago, I wrote a Java API that use JNI to wrap around a C API. The C API used char strings and I used GetStringUTFChars to create the C strings from the Java Strings.

I neglected to think through the problems that might arise with non-ASCII characters.

Since then the creator of the C API has created wide character equivalents to each of his C functions that require or return wchar_t strings. I would like to update my Java API to use these wide character functions and overcome the issue I have with non-ASCII characters.

Having studied the JNI documentation, I am a little confused by the relative merits of using the GetStringChars or GetStringRegion methods.

I am aware that the size of a wchar_t character varies between Windows and Linux and am not sure of the most efficient way to create the C strings (and convert them back to Java strings afterwards).

This is the code I have at the moment which I think creates a string with two bytes per character:

int len;
jchar *Src;

len = (*env)->GetStringLength(env, jSrc);
printf("Length of jSrc is %d\n", len);

Src = (jchar *)malloc((len + 1)*sizeof(jchar));
(*env)->GetStringRegion(env, jSrc, 0, len, Src);
Src[len] = '\0';

However, this will need modifying when the size of a wchar_t differs from jchar.

RobWills
  • 3
  • 1
  • 3

1 Answers1

2

Isn't the C API creator willing to take step back and reimplement with UTF-8? :) Your work would essentialy disappear, needing only GetStringUTFChars/NewStringUTF.

jchar is typedefed to unsigned short and is equivalent to JVM char which is UTF-16. So on Windows where wchar_t is 2 bytes UTF-16 too, you can do away with the code you presented. Just copy the raw bytes around, allocate accordingly. Don't forget to free after you're finished with the C API call. Complement with NewString for conversion back to jstring.

The only other wchar_t size i am aware of is 4 bytes (most prominently Linux) being UTF-32. And here comes the problem: UTF-32 is not just UTF-16 somehow padded to 4 bytes. Allocating double the amount of memory is just a beginning. There is a substantial conversion to do, like this one which seems to be sufficiently free.

But if you are not after performance that much and are willing to give up the plain memory copying on Windows, i suggest going jstring to UTF-8 (which is what JNI provides natively with documented functionality) and then UTF-8 to UTF-16 or UTF-32 depending on sizeof(wchar_t). There won't be any assumptions about what byte order and UTF encoding each platform gives. You seem to care about it, i see that you are checking sizeof(jchar) which is 2 for the most of the visible universe :)

Pavel Zdenek
  • 7,146
  • 1
  • 23
  • 38
  • Thanks @pavel for your answer - certainly food for thought. When you say, "you can do away with the code you presented", I thought my code was just copying the raw bytes around (apart from the addition of the null terminator? How else could I do that? Thanks also for the Unicode conversion link and info because that was my next problem! (I tried to vote up your answer but I'm not worthy) – RobWills Jan 15 '13 at 13:50
  • Within the particular combination of Windows on Intel, you can really do away with copying raw bytes, because the storage is UTF-16LE in both cases. If JVM runs on BE architecture (rare but possible), then you would have to add some byte swapping. The "else" solution is not assuming anything and use JNI to convert through UTF-8 which is byte order agnostic. – Pavel Zdenek Jan 15 '13 at 14:14
  • Come back when you're badged worthy, no problem :) – Pavel Zdenek Jan 15 '13 at 14:19
  • Well I've had further discussion with the database vendor who wrote the C API and we've agreed to revert to the UTF-8 functions so thanks again for your advise. – RobWills Jan 15 '13 at 16:50
  • Wow, you must have serious relationship with the database vendor :) Normally, poor JNI programmer works around C API deficiencies, not vice versa. Glad to see this resolution. – Pavel Zdenek Jan 16 '13 at 09:13
  • It is worth noting that GetStringUTFChars doesn't return valid UTF8, and may cause problems if text outside the BMP (codepoint >= 0x10000) ever enters in your system (hi emoji!), and other part of the system is strict about UTF8 validity. To clarify, those code points are encoded as surrogate pairs, which are forbidden codepoints as UTF8. – scoopr Aug 15 '13 at 05:40