Equivalent of MemorySegment.getUtf8String for UTF-16

Question

I'm porting my JNA-based library to "pure" Java using the Foreign Function and Memory API ([JEP 424][1]) in JDK 19.

One frequent use case my library handles is reading (null-terminated) Strings from native memory. For most *nix applications, these are "C Strings" and the MemorySegment.getUtf8String() method is sufficient to the task.

Native Windows Strings, however, are stored in UTF-16 (LE). Referenced as arrays of TCHAR or as "Wide Strings" they are treated similarly to "C Strings" except consume 2 bytes each.

JNA provides a Native.getWideString() method for this purpose which invokes native code to efficiently iterate over the appropriate character set.

I don't see a UTF-16 equivalent to the getUtf8String() (and corresponding set...()) optimized for these Windows-based applications.

I can work around the problem with a few approaches:

If I'm reading from a fixed size buffer, I can create a new String(bytes, StandardCharsets.UTF_16LE) and:
- If I know the memory was cleared before being filled, use trim()
- Otherwise split() on the null delimiter and extract the first element
If I'm just reading from a pointer offset with no knowledge of the total size (or a very large total size I don't want to instantiate into a byte[]) I can iterate character-by-character looking for the null.

While certainly I wouldn't expect the JDK to provide native implementations for every character set, I would think that Windows represents a significant enough usage share to support its primary native encoding alongside the UTF-8 convenience methods. Is there a method to do this that I haven't discovered yet? Or are there any better alternatives than the new String() or character-based iteration approaches I've described?

[The *CharsetDecoder* class should be used when more control over the decoding process is required](https://download.java.net/java/early_access/loom/docs/api/jdk.incubator.foreign/jdk/incubator/foreign/MemorySegment.html#getUtf8String(long)). — JosefZ, Oct 30 '22 at 09:57
Related question: [CLinker.toCString replacement in Java 18](https://stackoverflow.com/questions/71729585/clinker-tocstring-replacement-in-java-18/) — Johannes Kuhn, Jan 04 '23 at 17:11

DuncG · Accepted Answer · 2023-01-04T15:39:07.850

A charset decoder provides a way to deal with null terminated MemorySegment wide / UTF16_LE to String on Windows using Foreign Memory API. This may not be any different / improvement to your workaround suggestions, as it involves scanning the resulting character buffer for the null position.

public static String toJavaString(MemorySegment wide) {
    return toJavaString(wide, StandardCharsets.UTF_16LE);
}
public static String toJavaString(MemorySegment segment, Charset charset) {
    // JDK Panama only handles UTF-8, it does strlen() scan for 0 in the segment
    // which is valid as all code points of 2 and 3 bytes lead with high bit "1".
    if (StandardCharsets.UTF_8 == charset)
        return segment.getUtf8String(0);

    // if (StandardCharsets.UTF_16LE == charset) {
    //     return Holger answer
    // }

    // This conversion is convoluted: MemorySegment->ByteBuffer->CharBuffer->String
    CharBuffer cb = charset.decode(segment.asByteBuffer());

    // cb.array() isn't valid unless cb.hasArray() is true so use cb.get() to
    // find a null terminator character, ignoring it and the remaining characters
    final int max = cb.limit();
    int len = 0;
    while(len < max && cb.get(len) != '\0')
        len++;

    return cb.limit(len).toString();
}

Going the other way String -> null terminated Windows wide MemorySegment:

public static MemorySegment toCString(SegmentAllocator allocator, String s, Charset charset) {
    // "==" is OK here as StandardCharsets.UTF_8 == Charset.forName("UTF8")
    if (StandardCharsets.UTF_8 == charset)
        return allocator.allocateUtf8String(s);

    // else if (StandardCharsets.UTF_16LE == charset) {
    //     return Holger answer
    // }

    // For MB charsets it is safer to append terminator '\0' and let JDK append
    // appropriate byte[] null termination (typically 1,2,4 bytes) to the segment
    return allocator.allocateArray(JAVA_BYTE, (s+"\0").getBytes(charset));
}

/** Convert Java String to Windows Wide String format */
public static MemorySegment toWideString(String s, SegmentAllocator allocator) {
    return toCString(allocator, s, StandardCharsets.UTF_16LE);
}

Like you, I'd also like to know if there are better approaches than the above.

Thanks for these great examples. I'd like to include a method based on this code in my AL2.0 licensed library without imposing the attribution requirement of CC BY-SA license on my downstream users. May I have your permission to use your code under less restrictive license requirements? — Daniel Widdis, Nov 01 '22 at 06:03
@DanielWiddis Yes if you think it helps your work, you may use the example without attribution (and of course you're aware there is no guarantee it works / is suitable etc). Hopefully these conversions will be built into a future JDK. — DuncG, Nov 01 '22 at 08:01
I wouldn’t assume that every `CharBuffer` returned by `charset.decode(…)` is guaranteed to support access to the underlying array (if it is array based at all). On the other hand, you don’t need to access the array. You can search for the zero on the `CharBuffer` and set the `limit`, then simply calling `toString()` on the `CharBuffer` will give you the result. And when not relying on an array, the OP’s specific case doesn’t need a conversion at all as the `asCharBuffer()` *view* on the memory segment does a fluent UTF-16 interpretation (see my answer). — Holger, Jan 04 '23 at 10:44
@Holger Thanks, your answer cuts down on the transformations for wide conversion, and I'll fix mine to avoid `array()` (not valid unless `hasArray() == true`). — DuncG, Jan 04 '23 at 11:53

score 3 · Answer 2 · answered Jan 04 '23 at 10:29

Since Java’s char is a UTF-16 unit, there’s no need for special “wide string” support in the Foreign API, as the conversion (which may be a mere copying operation in some cases) does already exist:

public static String fromWideString(MemorySegment wide) {
  var cb = wide.asByteBuffer().order(ByteOrder.nativeOrder()).asCharBuffer();
  int limit = 0; // check for zero termination
  for(int end = cb.limit(); limit < end && cb.get(limit) != 0; limit++) {}
  return cb.limit(limit).toString();
}

public static MemorySegment toWideString(String s, SegmentAllocator allocator) {
  MemorySegment ms = allocator.allocateArray(ValueLayout.JAVA_CHAR, s.length() + 1);
  ms.asByteBuffer().order(ByteOrder.nativeOrder()).asCharBuffer().put(s).put('\0');
  return ms;
}

This is not using UTF-16LE specifically, but the current platform’s native order, which is usually the intended thing on a platform with native wide strings. Of course, when running on Windows x86 or x64, this will result in the UTF-16LE encoding.

Note that CharBuffer implements CharSequence which implies that for a lot of use cases you can omit the final toString() step when reading a wide string and effectively process the memory segment without a copying step.

Is the UTF-16 internal guaranteed? I recall something about internally storing ASCII strings in UTF8 since Java 11. — Daniel Widdis, Jan 04 '23 at 15:30
What you mean, is an implementation detail of the `String` class; it uses ISO-LATIN-1 where appropriate since JDK 9. However, the `CharBuffer` view on a byte sequence is guaranteed to use UTF-16. That’s why there’s no way around the `put(String)` method copying the data from the string’s internal representation to the buffer’s UTF-16. It may be a plain copying operation (when the string contains non-latin characters) or an inflation from one-byte to two-byte representation. In other words, the operation is equivalent to calling `putChar` on the `ByteBuffer` for every `char`. — Holger, Jan 04 '23 at 15:39

Equivalent of MemorySegment.getUtf8String for UTF-16

2 Answers2

Linked