1

I am using following piece of code for converting to UTF 8 on Linux. Please note that for me sizeof(wchar_t) = 2 due to compiler flag

void convert(const wchar_t* data, size_t len)
{
ASSERT(sizeof(wchar_t) == sizeof(jchar));

JNIEnv* env = GetEnv();
JString jstr = env->NewString((const jchar *)data, len);

int cbMLen = jStr.GetStringUTFLength();

char* pUTF8Str = new (std::nothrow) char[cbLen + 1];
//IFALLOCFAILED_EXIT(pUTF8String);

strncpy_s(pUTF8Str, cbLen + 1, jStr.GetUTFString(), cbLen);
// release memory...
}

Code is crashing at NewString for certain set of Unicode characters. Am I doing something wrong?

user1989504
  • 133
  • 1
  • 13
  • The character encoding UTF does not exist. What do you mean ? UTF-16, UTF-32 or something else ? BTW Java uses UTF-16 for its strings (`String`). Also note that there is no conversion in your code going from one encoding to another. You just pass the pointer. – Ludovic Kuty Dec 14 '15 at 13:15
  • Issue is that my code is crashing inside NewString() in JNI.h for some Unicode characters while it is working fine for majority of cases. Added some more code to give some context – user1989504 Dec 14 '15 at 13:38
  • I think that `NewString` expects Java characters, that means an UTF-16 encoded string with surrogate pairs to handle characters outside the basic multilingual plane (BMP). Please check the documentation of the [`Character`](http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html) class in Java. There also exists a function named `NewStringUTF` in JNI. – Ludovic Kuty Dec 14 '15 at 13:51
  • Also could you tell us for which characters does the function crash ? – Ludovic Kuty Dec 14 '15 at 13:52
  • I cannot get the input characters as this is something we have captured by analyzing third party experience of our product. This is the major bottleneck :( I an trying to find examples in which this would crash. I know about NewStringUTF but that takes char* as input which will not serve my purpose. – user1989504 Dec 14 '15 at 14:03
  • Java character is 2 byte and wchar_t is also 2 byte in my case. That is why type-casted the input to NewString. I think this should suffice. – user1989504 Dec 14 '15 at 14:05
  • I do not think so. Because of surrogate pairs in UTF-16. Sometimes one Unicode character is coded as two 16-bits chars, in a pair called a surrogate pair. Thus we have to interpret the data as a single character although there are two 16-bits chars. I don't know if the crash comes from there but it could. Note that most importantly, `data` should probably be valid UTF-16. I say should because I am not familiar with `NewString`. So check it out... – Ludovic Kuty Dec 14 '15 at 14:09
  • Note that you didn't answer my question regarding the word "UTF" that you used. What is it ? – Ludovic Kuty Dec 14 '15 at 14:10
  • Oh, my bad. That's just the terminology I am using for naming my APIs. Please ignore. I have corrected the question as well – user1989504 Dec 14 '15 at 14:22
  • Are the characters in `data` encoded in UTF-16 ? – Ludovic Kuty Dec 14 '15 at 14:27
  • Please check [this question](http://stackoverflow.com/q/16939349/452614) and the answer by Tom Blodget – Ludovic Kuty Dec 14 '15 at 14:34

0 Answers0