Some analysis on a Java application showed that it's spending a lot of time decoding UTF-8 byte arrays into String objects. The stream of UTF-8 bytes are coming from a LMDB database, and the values in the database are Protobuf messages, which is why it's decoding UTF-8 so much. Another problem being caused by this is Strings are taking up a large chunk of memory because of the decoding from the memory-map into a String object in the JVM.
I want to refactor this application so it does not allocate a new String every time it reads a message from the database. I want the underlying char array in the String object to simply point to the memory location.
package testreflect;
import java.lang.reflect.Field;
import sun.misc.Unsafe;
public class App {
public static void main(String[] args) throws Exception {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
Unsafe UNSAFE = (Unsafe) field.get(null);
char[] sourceChars = new char[] { 'b', 'a', 'r', 0x2018 };
// Encoding to a byte array; asBytes would be an LMDB entry
byte[] asBytes = new byte[sourceChars.length * 2];
UNSAFE.copyMemory(sourceChars,
UNSAFE.arrayBaseOffset(sourceChars.getClass()),
asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
sourceChars.length*(long)UNSAFE.arrayIndexScale(sourceChars.getClass()));
// Copying the byte array to the char array works, but is there a way to
// have the char array simply point to the byte array without copying?
char[] test = new char[sourceChars.length];
UNSAFE.copyMemory(asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
test,
UNSAFE.arrayBaseOffset(test.getClass()),
asBytes.length*(long)UNSAFE.arrayIndexScale(asBytes.getClass()));
// Allocate a String object, but set its underlying
// byte array manually to avoid the extra memory copy
long stringOffset = UNSAFE.objectFieldOffset(String.class.getDeclaredField("value"));
String stringTest = (String) UNSAFE.allocateInstance(String.class);
UNSAFE.putObject(stringTest, stringOffset, test);
System.out.println(stringTest);
}
}
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
However, this does not solve the memory problem. Is there a way to have a char array point to a memory location and avoid a memory allocation altogether? Avoiding the copy altogether will reduce the number of unnecessary allocations the JVM is making for these strings, leaving more room for the OS to cache entries from the LMDB database.