4

I'm running a Java software on an ARM v6 processor. The nature of this program requires me to convert some numbers (int or float) to String. The processor runs at 850Mhz. The Java Runtime is OpenJDK Zero VM 1.7.0_21-b02.

I'm not expecting rock-solid performances here, but I would expect something much more efficient than what I am seeing with the code snippet below.

    long time1, time2;

    float[] src = new float[2000000];
    for (int i = 0; i < src.length; i++) {
        src[i] = (float)Math.random()* 2.56454512f * (float) Math.random();
    }
    time1 = System.nanoTime();
    for (int j = 0; j < src.length; j++) {
        String test = String.valueOf(src[j]);
    }
    time2 = System.nanoTime();
    logTimeDelay("String.valueOf", time1, time2);

    time1 = System.nanoTime();
    for (int j = 0; j < src.length; j++) {
        String test = Float.toString(src[j]);
    }
    time2 = System.nanoTime();
    logTimeDelay("Float.toString", time1, time2);


    StringBuilder sb = new StringBuilder(50);
    time1 = System.nanoTime();
    for (int j = 0; j < src.length; j++) {
        sb.setLength(0);
        sb.append(src[j]);
    }
    time2 = System.nanoTime();
    logTimeDelay("StringBuilder.append, setLength", time1, time2);

    time1 = System.nanoTime();
    for (int j = 0; j < src.length; j++) {
        String test = "" + src[j];
    }
    time2 = System.nanoTime();
    logTimeDelay("\"\" + ", time1, time2);

    private static void logTimeDelay(String message, long time1, long time2){
        System.out.println(String.format(message + ": %.5f s", (float) (time2 - time1) / 1.0e9));
    }

Running this code snippet on my i7 computer returns the following results:

String.valueOf: 0.39714 s
Float.toString: 0.33295 s
StringBuilder.append, setLength: 0.33277 s
"" + : 0.37581 s

Running the exact same code snippet on the ARMv6 processor returns the following values:

String.valueOf: 204.78758 s
Float.toString: 200.79659 s
StringBuilder.append, setLength: 180.81551 s
"" + : 267.63036 s

Any clues on how I could optimize my numbers to int conversion on this device?

Thanks in advance.

crazydread18
  • 77
  • 1
  • 9

1 Answers1

1

"Out of thin air" hypothesis, but the difference in performance you observe here seems to be related to CPU caching; your ARM CPU has far less cache than your desktop's i7.

Your float array has two millions elements in it; that makes for a minimum of 8 MB storage. Those 8 MB need to reach the CPU.

I also have an i7 here and the size of caches is: 32kb (L1), 256kb (L2), 6MB (L3); three quarters of the float array can fit into L3! It seems that in your case there can only be 32kb at a time... Therefore there is a lot of cache thrashing and the memory bus traffic is very high.

I suspect that if you reduce your array size to something which fits 32kb (for instance, try with only 1000 floats) the performance figures will be far closer.

EDIT: it also happens that your CPU does not have an FPU; that accounts for the majority of the performance loss, as @Voo mentioned.

So:

  • lack of an FPU,
  • small cache,
  • lots of data.

For a more "realistic" comparison, you should test over a smaller subset of data; this will at least alleviate (but not completely eliminate) the cache problem.

fge
  • 119,121
  • 33
  • 254
  • 329
  • Rather unlikely considering the low amounts of memory access to begin with. Way way more likely is the fact that not all ARMv6 choirs have a hw float co processor which is rather bad considering how division heavy string conversions are. Also the benchmark is (as usually) completely broken to begin with and different JVMs handle that differently well (OSR active? ) – Voo Feb 28 '14 at 16:00
  • Thanks for your input on this. I undersdtand that there is a significant difference in the caching and processing power of my ARM processor and my i7. This goes without saying. However, I'm puzzled by how good the performance is in various data treatment... but when it comes to number to string conversion, the performances are abysmal. Especially with float and double. Apparently, integer are not so bad. Doing the same test with a int[] of 200000 int produces String.valueOf: 27.49648 s Integer.toString: 25.84437 s StringBuilder.append, setLength: 11.58025 s "" + : 96.30360 s – crazydread18 Feb 28 '14 at 16:00
  • @Voo maybe so, but it still _is_ a factor in my opinion; I wouldn't consider accessing 8MB a "low cost" operation when your cache is only 32 KiB – fge Feb 28 '14 at 16:02
  • @crazy first read up on how to correctly benchmark code in java, will save you many surprises the future. Then check whether your CPU has a float unit or not. – Voo Feb 28 '14 at 16:03
  • @crazydread18 then Voo probably has a point about a missing FPU here – fge Feb 28 '14 at 16:03
  • @fge for reach single benchmark you only read each number once, so the only profit you'll get is from the source array whose parts weren't evicted in the previous round. Then this is perfectly streamable memory access which will further reduce the effects (assuming the Jit kicks in and dots the right thing). Does it influence the results? Sure but hardly to such a noticeable degree – Voo Feb 28 '14 at 16:07
  • @Voo I fail to see your point -- as far as the CPU is concerned there is no "streamable memory", there is itself, its cache and the memory bus; and reading from the memory bus takes time – fge Feb 28 '14 at 16:09
  • Right. This processor does not have a hardware float co processor. And @Voo, thanks for your input, but why bash on me (with the 'as usually', etc) rather than teach. I do not claim to know everything, otherwise I wouldn't ask questions here. Any suggestion on how I could improve the actual results? I'm thinking about multiplying the floats by a few orders of magnitude, then cast to int before converting to string. – crazydread18 Feb 28 '14 at 16:10
  • @fge I do not think the memory acess time is relevant here. Floats and Integer are both 32 bytes. So streaming an int or a float to CPU should have the same overhead, right? – crazydread18 Feb 28 '14 at 16:12
  • @crazydread18 yes, but as your integer test demonstrates, you _still_ have very "bad" results compared to your i7, right? – fge Feb 28 '14 at 16:16
  • Well. The i7 is much less powerfull on the arm processor, would't it be enough to produce such results? – crazydread18 Feb 28 '14 at 16:20
  • @crazydread18 uh, can you rephrase? I kind of heard you say that the i7 is less powerful than the ARM? I'd have it the other way around... – fge Feb 28 '14 at 16:21
  • @fge haha you are right. I wrote faster than my thoughts. The ARM is much slower than the i7. – crazydread18 Feb 28 '14 at 16:22
  • @crazydread18 well that is certainly a factor, yes, but as I said, if you _really_ want to compare performance, you should do so with datasets which fit as much as possible into the cache; even for an ARM, having to fetch data from RAM is _slow_. – fge Feb 28 '14 at 16:24
  • @crazydread18 for instance, instead of doing 2000000 * 1, do 1000 * 2000 -- operate on the same set of 1000 floats, 2000 times; it will easily fit into both CPU's caches, and this will no longer be a concern – fge Feb 28 '14 at 16:28
  • @fge I agree with everything you say here, but I think this is drifting a bit off topic. What I'm really interested about here is not comparing the performance of the 2 systems, but simply find the fastest way possible to convert floating point numbers to their string representation in Java. This simple conversion is creating a MASSIVE bottleneck in my application that is otherwise running smoothly. – crazydread18 Feb 28 '14 at 16:29
  • @crazydread18 I don't really see a solution for that, except maybe if you operate with a fixed set of decimals, do calculations with integer values? – fge Feb 28 '14 at 16:31
  • @crazydread It's not bashing, it's just a declaration that the benchmark is broken which can certainly cause problems when using it for comparing numbers (e.g. the ARM JIT may not be able to do OSR which would explain at least one order of magnitude wrt performance already). The only reliable way to improve performance on ARM cores that don't have a float co-processor is to not use floats but fixed point math. If that's not possible you may check what kind of emulator the runtime uses and see if you can find optimized routines to use instead. – Voo Mar 01 '14 at 13:57
  • @Voo Thanks! This is a much much better answer than the last one! ;) We had to write a couple of floats to a CSV file (float to String conversion requires lot of float operations...), about 100 lines per seconds. This was a bit too much for this processor to handle. Writing strings, or event integer to strings at the same rate was OK though. We switched everything to a binary file format, then we are doing the binary format to CSV conversion on a remote machine. – crazydread18 Mar 01 '14 at 15:19