0

Some analysis on a Java application showed that it's spending a lot of time decoding UTF-8 byte arrays into String objects. The stream of UTF-8 bytes are coming from a LMDB database, and the values in the database are Protobuf messages, which is why it's decoding UTF-8 so much. Another problem being caused by this is Strings are taking up a large chunk of memory because of the decoding from the memory-map into a String object in the JVM.

I want to refactor this application so it does not allocate a new String every time it reads a message from the database. I want the underlying char array in the String object to simply point to the memory location.

package testreflect;

import java.lang.reflect.Field;

import sun.misc.Unsafe;

public class App {
    public static void main(String[] args) throws Exception {
        Field field = Unsafe.class.getDeclaredField("theUnsafe");
        field.setAccessible(true);
        Unsafe UNSAFE = (Unsafe) field.get(null);

        char[] sourceChars = new char[] { 'b', 'a', 'r', 0x2018 };

        // Encoding to a byte array; asBytes would be an LMDB entry
        byte[] asBytes = new byte[sourceChars.length * 2];
        UNSAFE.copyMemory(sourceChars, 
                UNSAFE.arrayBaseOffset(sourceChars.getClass()), 
                asBytes, 
                UNSAFE.arrayBaseOffset(asBytes.getClass()), 
                sourceChars.length*(long)UNSAFE.arrayIndexScale(sourceChars.getClass()));

        // Copying the byte array to the char array works, but is there a way to
        // have the char array simply point to the byte array without copying?
        char[] test = new char[sourceChars.length];
        UNSAFE.copyMemory(asBytes, 
                UNSAFE.arrayBaseOffset(asBytes.getClass()), 
                test, 
                UNSAFE.arrayBaseOffset(test.getClass()), 
                asBytes.length*(long)UNSAFE.arrayIndexScale(asBytes.getClass()));

        // Allocate a String object, but set its underlying 
        // byte array manually to avoid the extra memory copy   
        long stringOffset = UNSAFE.objectFieldOffset(String.class.getDeclaredField("value"));
        String stringTest = (String) UNSAFE.allocateInstance(String.class);
        UNSAFE.putObject(stringTest, stringOffset, test);
        System.out.println(stringTest);
    }
}

So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.

However, this does not solve the memory problem. Is there a way to have a char array point to a memory location and avoid a memory allocation altogether? Avoiding the copy altogether will reduce the number of unnecessary allocations the JVM is making for these strings, leaving more room for the OS to cache entries from the LMDB database.

user1428945
  • 339
  • 1
  • 13
  • 1
    If performence is critical, why not implementing a CharSequence to fit your needs? Strings are just a special type of CharSequence with emphasis in immutability and encapsulation rather than performence. Your CharSequence implementation could have other priorities in mind, for example accept a byte[] and perform no copying or extra memory allocation. https://docs.oracle.com/javase/7/docs/api/java/lang/CharSequence.html – Alkis Mavridis Oct 13 '18 at 03:56
  • I might, but it will require refactoring a bunch of code we already have, (equals would have to be replaced with compareTo, etc). – user1428945 Oct 13 '18 at 03:59
  • 1
    Some clarification, I'm hesitant to use CharSequence because the compiler is not going to help finding potentially code-breaking changes by replacing String instances with CharSequence. It is, however, a valid option and I will consider it. However, keeping String around will be ideal. – user1428945 Oct 13 '18 at 04:09

1 Answers1

3

I think you are taking the wrong approach here.

So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.

Erm ... no.

Using memory copy to copy from a byte[] to char[] is not going to work. Each char in the destination char[] will actually contain 2 bytes from the original. If you then try to wrap the char[] as a String, you will get a weird kind of mojibake.

What a real UTF-8 to String conversion does it to convert between 1 and 4 bytes (codeunits) representing a UTF-8 codepoint into 1 or 2 16-bit codeunits representing the same codepoint in UTF-16. That cannot be done using a plain memory copy.

If you aren't familiar with it, it would be worth reading the Wikipedia article on UTF-8 so that you understand how the text is encoded.


The solution depends on what you intend to do with the text data.

  • If the data must really be in the form of String (or StringBuilder or char[]) objects, then you really have no choice but to do the full conversion. Try anything else and you are liable to mess up; e.g. garbled text and potential JVM crashes.

  • If you want something that is "string like", you could conceivably implement a custom subclass of CharSequence, that wraps the bytes in the messages and decodes the UTF-8 on the fly. But doing that efficiently make be a problem, especially implementing the charAt method as an O(1) method.

  • If you are simply wanting to hold and/or compare the (entire) texts, this could possibly be done by representing them as or in a byte[] objects. These operations can be performed on the UTF-8 encoded data directly.

  • If the input text could actually be sent in character encoding with a fixed 8-bit character size (e.g. ASCII, Latin-1, etc) or as UTF-16, that simplifies things.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Sorry for not making this clear, but I'm not going to encode the strings to UTF-8 when storing them in the database. I'm going to keep the UTF-16 encoding, though I don't know if I'm going to switch from Protobuf or write a custom serializer. I know there's problems like endianness, but the application does not communicate with other services, so it shouldn't be affected by this. – user1428945 Oct 13 '18 at 16:47
  • You you use protobuf to send the text as UTF-16. But you need to look at the *entire* application (clients, services, databases, etc) if you are going optimize to this degree. – Stephen C Oct 14 '18 at 00:41
  • I have looked at and fixed various bottlenecks and performance-hindering code in the app. Decoding UTF-8 is the next bottleneck I'm tackling. – user1428945 Oct 14 '18 at 02:21
  • What I mean is that you need to look at this holistically, not piecemeal. If you change to transmitting UTF-16, you are most likely going to double the size of the text in the messages, and increase protobuf marshalling / unmarshalling costs on the client and server sides. You will also need to change all of the clients. This is not a simple change. – Stephen C Oct 14 '18 at 02:40
  • There's one client. And I am going to benchmark this. If it turns out decoding UTF-8 is faster than copying some memory, I'll drop this. – user1428945 Oct 14 '18 at 02:56