mmap() vs Java MappedByteBuffer performance?

Question

I have been developing a C++ project from existing Java code. I have the following C++ code and Java code reading from the same test file, which consists of millions of integers.

C++:

    int * arr = new int[len]; //len is larger than the largest int from the data
    fill_n(arr, len, -1);  //fill with -1
    long loadFromIndex = 0;
    struct stat sizeResults;
    long size;
    if (stat(fileSrc, &sizeResults) == 0) {
        size = sizeResults.st_size; //here size would be ~551950000 for 552M test file
    }
    mmapFile = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED, fd, pageNum*pageSize);
    long offset = loadFromIndex % pageSize;
    while (offset < size) {
        int i = htonl(*((int *)(mmapFile + offset)));
        offset += sizeof(int);
        int j = htonl(*((int *)(mmapFile + offset)));
        offset += sizeof(int);
        swapElem(i, j, arr);
    }
    return arr;

Java:

    IntBuffer bb = srcFile.getChannel()
                    .map(MapMode.READ_ONLY, loadFromIndex, size)
                    .asIntBuffer().asReadOnlyBuffer();
    while (bb.hasRemaining()) {
        int i = bb.get();
        int j = bb.get();
        swapElem(i, j, arr); //arr is an int[] of the same size as the arr in C++ version, filled with -1
    }
    return arr;

void swapElem(arr) in C++ and Java are identical. It compares and modifies values in the array, but the original code is kind of long to post here. For testing purpose, I replaced it with the following function so the loop won't be dead code:

void swapElem(int i, int j, int * arr){   // int[] in Java
    arr[i] = j;
}

I assumed the C++ version should outperform the java version, but the test gives the opposite result -- Java code is almost two times faster than the C++ code. Is there any way to improve the C++ code?

I feel maybe the mmapFile+offset in C++ is repeated too many times so it is O(n) additions for that and O(n) additions for offset+=sizeof(int), where n is number of integers to read. For Java's IntBuffer.get(), it just directly reads from a buffer's index so no addition operation is needed except O(n) increments of the buffer index by 1. Therefore, including the increments of buffer index, C++ takes O(2n) additions while Java takes O(n) additions. When it comes to millions of data, it might cause significant performance difference.

Following this idea, I modified the C++ code as follows:

    mmapBin = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED, fd, pageNum*pageSize);
    int len = size - loadFromIndex % pageSize;
    char * offset = loadFromIndex % pageSize + mmapBin;
    int index = 0;
    while (index < len) {
        int i = htonl(*((int *)(offset)));
        offset += sizeof(int);
        int j = htonl(*((int *)(offset)));
        offset += sizeof(int);
        index+=2*sizeof(int);
    }

I assumed there will be a slight performance gain, but there isn't.

Can anyone explain why the C++ code works slower than the Java code does? Thanks.

Update:

I have to apologize that when I said -O2 does not work, there was a problem at my end. I messed up Makefile so the C++ code did not recompile using -O2. I've updated the performance and the C++ version using -O2 has outperformed the Java version. This can seal the question, but if anyone would like to share how to improve the C++ code, I will follow. Generally I would expect it to be 2 times faster than the Java code, but currently it is not. Thank you all for your input.

Compiler: g++

Flags: -Wall -c -O2

Java Version: 1.8.0_05

Size of File: 552MB, all 4 byte integers

Processor: 2.53 GHz Intel Core 2 Duo

Memory 4GB 1067 MHz DDR3

Updated Benchmark:

Version Time(ms)

C++: ~1100

Java: ~1400

C++(without the while loop): ~35

Java(without the while loop): ~40

I have something before these code that causes the ~35ms performance(mostly filling the array with -1), but that is not important here.

That is a very curious result. Can you try strace on the C++ program to see what the JVM is doing "under the hood" to access the file? Also, did you try compiling the C++ code with optimization (such as -O2)? — ash, Oct 28 '14 at 03:46
Hmm, the C++ code uses htonl() on the values but the Java code does not, right? Try taking that out and see the difference. — ash, Oct 28 '14 at 03:47
I would try looking at the JVM source code. Also you could try using the MAP_PRIVATE | MAP_POPULATE flags. — Maxaon3000, Oct 28 '14 at 03:47
Thanks for the answers. I updated some performance test and the result is still very curious. — Fenwick, Oct 28 '14 at 06:28
@Maxaon3000 I tried MAP_POPULATE, but it is not defined on Mac. It would be best to avoid using it. As for JVM, I've never messed with the source code, and really don't know where to begin with. If you mean the Java source code, I can see Java holds the IntBuffer in an array. For `.get()` it just reads from current array index then increase the index by 1. — Fenwick, Oct 28 '14 at 06:36
@ash I added some tests in the post, including taking off htonl(). The performance improves, but still shockingly slower than that of Java, and the code does not work as intended of course. For strace(), I'm on Mac so I can only use dtruss(). Do you mean I should dtruss the C++ program or the Java program? — Fenwick, Oct 28 '14 at 06:40
@Fenwick, can you please give more details how to repeat you experiment. That is: Size of the file, Java Version, C++ compiler used, compiler options, test machine processor. The easiest way is to look on the finally generated machine code, since this is a very tight routine. — cruftex, Oct 28 '14 at 06:44
Have you tried it with `MAP_PRIVATE` instead of `MAP_SHARED`? — Turix, Oct 28 '14 at 07:31
Putting your C++ code (first snippet, with htonl) in a main and adding just enough stuff to make it compile (mapped to the beginning of a 100M file, size=100M), I get ~130ms with no optimizations, ~2ms with `-O2` (the loop is simply stripped out by the compiler since it doesn't do anything). So please post an actual repro that people can play with. — Mat, Oct 28 '14 at 08:10
Here is the JVM code for the POSIX impl: https://github.com/awh/openjdk7/blob/master/jdk/src/solaris/native/sun/nio/ch/FileChannelImpl.c It seems to be doing the same thing you are. — Maxaon3000, Oct 28 '14 at 14:24
Also the performance here is not bound by what's in the loop. It's the number of page faults that the code generates. For some reason the Java code generates fewer page faults. — Maxaon3000, Oct 28 '14 at 14:34
Maybe the MacOSX JVM does something special. The next thing I would try is using the madvice call after you map the memory. — Maxaon3000, Oct 28 '14 at 14:41

score 0 · Answer 1 · answered Oct 28 '14 at 13:26

I have some doubts that the benchmark method is correct. Both codes are "dead" codes. You don't actually use i and j anywhere so the gcc compiler or Java JIT might decide to actually remove the loop seeing that it has no effect on the future code flow.

Anyway, I would change the C++ code to:

mmapFile = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED, fd, pageNum*pageSize);
long offset = loadFromIndex % pageSize;
int i, j;
int szInc = 2 * sizeof(int);
while (offset < size) {
    scanf(mmapFile, "%d", &i);
    scanf(mmapFile, "%d", &j);
    offset += szInc; // offset += 8;
}

This would be the equivalent to Java code. In addition I would continue using -O2 as compilation flags. Keep in mind that htonl is an extra conversion that Java code does not seem to do it.

I tried scanf(), but the program gets slower. You are right that g++ compiler skipped the loop because it is "dead". I've added a couple things to make the loop not "dead code" any more. — Fenwick, Oct 28 '14 at 18:49

mmap() vs Java MappedByteBuffer performance?

Update:

Updated Benchmark:

1 Answers1