17

I'm writing fairly low level code that must be highly optimized for speed. Every CPU cycle counts. Since the code is in Java I can't write as low level as in C for example, but I want to get everything out of the VM that I can.

I'm processing an array of bytes. There are two parts of my code that I'm primarily interested in at the moment. The first one is:

int key =  (data[i]     & 0xff)
        | ((data[i + 1] & 0xff) <<  8)
        | ((data[i + 2] & 0xff) << 16)
        | ((data[i + 3] & 0xff) << 24);

and the second one is:

key = (key << 15) | (key >>> 17);

Judging from the performance I'm guessing that these statements aren't optimized the way I expect. The second statement is basically a ROTL 15, key. The first statement loads 4 bytes into an int. The 0xff masks are there only to compensate for the added sign bits resulting from the implicit cast to int if the accessed byte happens to be negative. This should be easy to translate to efficient machine code, but to my surprise performance goes up if I remove the masks. (Which of course breaks my code, but I was interested to see what happens.)

What's going on here? Do the most common Java VMs optimize this code during JIT in the way one would expect a good C++ compiler to optimize the equivalent C++ code? Can I influence this process? Setting -XX:+AggressiveOpts seems to make no difference.

(CPU: x64, Platform: Linux/HotSpot)

Rinke
  • 6,095
  • 4
  • 38
  • 55
  • 6
    I don't understand how you get to the point where you **have** to worry about performance, and you **know** your performance is not already good enough, and you've already tried the broad algorithmic stuff and have determined that you need to micro-optimize, **and** you **have** to do it in a specific language that's not super-low-level... yet you don't know how to determine for yourself what the language is doing behind the scenes, and haven't already done a fair amount of research on the topic as a simple matter of personal interest. Optimization of this sort is a pretty specialized skill. – Karl Knechtel Nov 24 '11 at 11:31
  • 10
    Wow. I didn't know my Q would offend anyone. You don't seem to assume good faith. I'm putting much effort into this problem, but maybe my efforts would seem pathetic to you. In any case I'd be happy to learn from you. To answer your Qs: the language is a given from the product owner. The performance is critical by the nature of the prod. I don't want to use JNI in order not to break the platform indep. The code I'm talking about is deep in the core and where the CPU spends most of its time. That merits my question if you ask me. But even if not: wouldn't it be interesting in its own right? – Rinke Nov 24 '11 at 12:27
  • 1
    Oh, and I get 15 times the speed in C++ on the same machine (in terms of data-throughput). That's at least a clue that performance isn't good enough, isn't it? – Rinke Nov 24 '11 at 12:33
  • 1
    @Rinke Are you sure these line are the reason why the java version so much slower than the C++ version? – jmg Nov 24 '11 at 15:10
  • @jmg No, not really. The main loop is not big: around 10 instructions. These two have my special interest because I noticed a performance boost when I change them, which made me wonder if they are nicely translated into machine code. Would you expect them to? (E.g. the translation into ROTL.) – Rinke Nov 24 '11 at 15:20
  • 1
    @Rinke Cliff Click one of the original Hotspot developers always claims in his blog that the second tier Hotspot compiler generates code as good as a C compiler for code which is basically C. – jmg Nov 24 '11 at 15:25
  • 2
    If you get 15times the speedup in C++ chances are extremely good that you're just measuring wrong so post your **microbenchmarks** for the problem. @jmg Hotspot -c2 generates code about equal to a C compiler for this stuff sure, but last time I checked it didn't do much SSE - which depending on the larger code at hand could certainly give noticeable speedups for the C++ program. Again not much to say without more code. – Voo Nov 24 '11 at 16:00
  • 1
    I assume good faith on your part, not necessarily on your employer's or client's part :) I'm not offended, either... just find the situation somewhat confusing. Optimization of this sort is hard; even just getting meaningful measurements is hard. These tasks should be assigned to people who already at least know how to do the research, and ideally already know a lot about the inner workings of (in this case) the JVM (or equivalent knowledge for related situations). That said, I'll flail and offer something: have you tried ensuring the "server" VM is used? – Karl Knechtel Nov 24 '11 at 21:56
  • Thanks guys. I've made headway, but I'm not quite there yet. The answer by Puce about benchmarking was helpful and it turned out that the rediculous factor 15 was a great deal due to measurement issues. I'm still slower than the C++ version though. Next I'll try to get into the bytecode (must lean this). – Rinke Dec 05 '11 at 12:55

4 Answers4

7

How do you test the performance?

Here is a good article:

http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html

http://www.ibm.com/developerworks/java/library/j-benchmark2/index.html

http://ellipticgroup.com/html/benchmarkingArticle.html

Puce
  • 37,247
  • 13
  • 80
  • 152
  • Thanks a lot. Indeed, my measurements turned out to be flawed. Fixed this, but I'm still not as fast as the C++ version. – Rinke Dec 05 '11 at 12:57
5

I've done a lot of performance code in Java, I've even coded directly in Bytecode, enough to be sure of a couple of thing : the JIT is a black box with obscure behaviours, the JIT and compilers are incredibly efficient, and the simplest code usually yield the best performance.

This is normal when you think about the GOAL of the JIT: extract the best possible performance from any Java code. When you add that Java is quite a simple and plain language, the simple code will be optimized, and any further trick will generally do no good.

Of course, there are some common pitfalls and gotchas you ought to know, but I see none in your code samples. Were I to optimize your code, I would go straight to the higher level: algorithm. What is the complexity of your code? Can some data be cached? What APIs are used? Etc... There's a seemingly endless pit of performance to be extracted from algorithmic tricks alone.

And if even this is not sufficient, if the language is not fast enough, if your machine is not fast enough, if your algorithm cannot be made any faster, the answer won't lie in "clock cycles", because you might squeeze 20% of efficiency, but 20% will never be enough when your data grow. To be sure you will never hit (again) a performance wall, the ultimate answer lies in scalability: make your algorithm and your data endlessly distributable so you can just throw more workers to the task.

solendil
  • 8,432
  • 4
  • 28
  • 29
  • Thanks. I'm sorry ignore the more interesting points of your answer (i.e. in this comment), but can I conclude that from your experience you'd say that (1) the masks in the first statement shouldn't be explicit in the compiled machine code and (2) the second statement should be translated into a single ROTL (or ROTR)? – Rinke Nov 24 '11 at 15:45
3

I do agree with solendil, but if you want to dig deeper at the low-level, try getting the code produced by the JIT as described here.

jmg
  • 7,308
  • 1
  • 18
  • 22
  • That link is now broken, here's the [new one](https://wikis.oracle.com/display/HotSpotInternals/PrintAssembly) – skuro Jul 10 '13 at 07:55
0

You don't need to (& 0xff) before shifting 24 bits to the left.

Ingo
  • 36,037
  • 5
  • 53
  • 100