Cosine-similarity performance in Java 15 times slower than equivalent C?

Question

I have two functions, each of which calculates the cosine similarity of two different vectors. One is written in Java, and one in C.

In both cases I am declaring two 200 element arrays inline, and then calculating their cosine similarity 1 million times. I'm not counting the time for the jvm to startup. The Java implementation is nearly 15 times slower than the C implementation.

My questions are:

1.) Is it reasonable to assume that for tight loops of simple math c is still an order of magnitude faster than java?

2.) Is there some mistake in the java code, or some sane optimization that would dramatically speed it up?

Thanks.

C:

#include <math.h>

int main()
{
  int j;
  for (j = 0; j < 1000000; j++) {
    calc();
  }

  return 0;

}

int calc ()
{

  double a [200] = {0.269852, -0.720015, 0.942508, ...};
  double b [200] = {-1.566838, 0.813305, 1.780039, ...};

  double p = 0.0;
  double na = 0.0;
  double nb = 0.0;
  double ret = 0.0;

  int i;
  for (i = 0; i < 200; i++) {
    p += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }

  return p / (sqrt(na) * sqrt(nb));

}

$ time ./cosine-similarity

0m2.952s

Java:

public class CosineSimilarity {

            public static void main(String[] args) {

                long startTime = System.nanoTime();

                for (int i = 0; i < 1000000; i++) {
                    calc();
                }

                long endTime = System.nanoTime();
                long duration = (endTime - startTime);

                System.out.format("took %d%n seconds", duration / 1000000000);

            }

            public static double calc() {

                double[] vectorA = new double[] {0.269852, -0.720015, 0.942508, ...};
                double[] vectorB = new double[] {-1.566838, 0.813305, 1.780039, ...};

                double dotProduct = 0.0;
                double normA = 0.0;
                double normB = 0.0;
                for (int i = 0; i < vectorA.length; i++) {
                    dotProduct += vectorA[i] * vectorB[i];
                    normA += Math.pow(vectorA[i], 2);
                    normB += Math.pow(vectorB[i], 2);
                }
                return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
            }
    }

$ java -cp . -server -Xms2G -Xmx2G CosineSimilarity

took 44 seconds

Edit:

Math.pow was indeed the culprit. Removing it brought the performance right on par with that of C.

Not that it should affect the time, but you appear not to be using `ret` in the C example. — Kevin, Jan 01 '15 at 22:49
I wonder whether `Math.pow()` is responsible. Try `vectorA[i] * vectorA[i]`. — Kevin, Jan 01 '15 at 22:50
Also, you could return `dotProduct / Math.sqrt(normA * normB)` - but what you're returning DOES match the C version. — Dawood ibn Kareem, Jan 01 '15 at 22:52
Mind that on C you are using automatic allocated variables (on stack), while on Java you are forced to use heap (`new double[] ..`) so that's not equivalent. — Jack, Jan 01 '15 at 22:55
@Jack as Java doesn't have stack allocation of arrays, I'd argue that any advantage that gives C should count. — Kevin, Jan 01 '15 at 22:57
@Kevin: in that specific circumstance that's not true. I C you can take advantage of automatic allocation just because the arrays are not changing, which won't be true in a real case scenario. So you should at least pass the arrays in Java by allocating them just once outside. — Jack, Jan 01 '15 at 22:59
Note: C code used `int calc ()` rather than the equivalent `double calc ()` (Not a major time factor) — chux - Reinstate Monica, Jan 01 '15 at 23:02
Try same test with C code doing `pow(a[i],2);` and `pow(b[i],2);` Note: a highly optimized C compiler may still use `a[i]*a[i]`. — chux - Reinstate Monica, Jan 01 '15 at 23:08
@chux: note that some compilers (eg GCC) will optimise pow(x,2) to x*x. — Oliver Charlesworth, Jan 01 '15 at 23:10
@Scott This post is now confusing. The posted times do not correspond to the code. Suggest re-verting code to initial posting and leaving your "Edit" comments in — chux - Reinstate Monica, Jan 01 '15 at 23:41

score 3 · Accepted Answer · answered Jan 01 '15 at 22:57

3

Math.pow(a, b) does math.exp( math.log (a)*b) it's going to a very expensive way to square a number.

I suggest you write the Java code similar to the way you wrote the C code to get a closer result.

Note: the JVM can take a couple of seconds to warm up the code. I would run the test for longer.

answered Jan 01 '15 at 22:57

Peter Lawrey

525,659
79
751
1,130

2

Yes, it's always baffled me why people write `Math.pow(something,2)` instead of `something * something`. I would have thought this was fairly basic knowledge. – Dawood ibn Kareem Jan 01 '15 at 23:00
Surely `Math.pow(a, b)` is not specified in such an inaccurate way as `math.exp( math.log (a)*b)`. It **is** expensive, but not because it is computed in this fashion. – Pascal Cuoq Jan 01 '15 at 23:03
@PascalCuoq this is a simplification. Actually pow is more expensive than exp + log so you are right that does more than that. – Peter Lawrey Jan 01 '15 at 23:14
@David Wallace It is such basic knowledge that good optimizing compilers use the faster method regardless on how it is coded. – chux - Reinstate Monica Jan 01 '15 at 23:15
1

It made a difference to me. I padded out the `...` in the Java program with some numbers to make the vectors up to be length 200. I can confirm that the version with `Math.pow` consistently took between 5.8 and 5.9 seconds to run on my laptop; whereas the version with `*` consistently took about 0.9 seconds on the same laptop. So regardless of what `Math.pow` is doing, it's certainly `6-7` times slower. – Dawood ibn Kareem Jan 01 '15 at 23:20
@chux In the light of my experiment, are you saying that `javac` (as shipped in 1.7.0_60-b19) is not a good optimizing compiler? – Dawood ibn Kareem Jan 01 '15 at 23:24
@OliverCharlesworth The phrase “for the Math class, a larger error bound of 1 or 2 ulps is allowed for certain methods” (http://docs.oracle.com/javase/7/docs/api/java/lang/Math.html ) precludes that naïve implementation. The naïve composition of double-precision `exp` and `log` makes a `pow` that is (in)accurate to more than 500 ULP in the worst cases. The netlib implementation computes the logarithm to higher precision, it is certainly not `math.exp( math.log (a)*b)`. – Pascal Cuoq Jan 01 '15 at 23:26
@Scott - I don't understand. You edited the question after this answer was posted, to make this answer incorrect. You then accepted this answer. Are you looking for further answers or not? – Dawood ibn Kareem Jan 01 '15 at 23:32
@David. This answer was correct. I initially edited the answer to remove Math.pow thinking it was an irrelevant detail, and still seeking an answer to the original question. Now I realize Math.pow was entirely the problem, so I've edited the question back to contain the poor implemenation so as not to confuse others that come back later. – Scott Klarenbach Jan 01 '15 at 23:34
It's interesting that the implementation makes no checks on the size of the exponent before passing it along to the expensive approximation. Surely if 0 <= exponent <= some small natural number, it'd be way quicker to just loop multiplying the number by itself. A simple check is cheap enough and the gains are likely pretty significant. – Michael Goldstein Jan 02 '15 at 02:35
@PascalCuoq While this is true I had assumed it used 80 - bit floating point which would have been accurate enough when casting the result to 64-bit. – Peter Lawrey Jan 02 '15 at 07:48

score 3 · Answer 2 · answered Jan 01 '15 at 23:30

3

I have seen factors of 2 in tight graphics loops. Never 15.

I'd be very suspicious of your test. In addition to the other excellent points already presented, consider that many C compilers (including e.g. gcc) are capable of deducing that the result of your computation is never used and, consequently, that arbitrary chunks up to and including the whole benchmark can be optimized away. You'll need to look at the generated code to determine if this is happening.

answered Jan 01 '15 at 23:30

Gene

46,253
4
58
96

1

yup the test was the problem. :) I was using Math.pow in the java version but forgot to change it to a * a instead. Removing that detail brought the performance to the equivalent of C. – Scott Klarenbach Jan 01 '15 at 23:39
"You'll need to look at the generated code to determine if this is happening." or compile with `-O0` :) – Michael Goldstein Jan 02 '15 at 02:37

score 1 · Answer 3 · answered Jan 01 '15 at 23:15

In addition to the comment about Math.Pow(x,2) not being directly comparable to x*x, see other answers regarding benchmarking java. TL,DR: Doing it right isn't simple or easy.

Since the Java environment includes execution-time compilation (the JIT compiler), and may include execution-time dynamic optimization ("Hotspot" and similar technologies), getting valid Java performance numbers is complicated. You need to specify whether you're interested in early or steady-state performance, and if the latter you need to allow the JRE to warm up before you start measuring -- and even then the results may be significantly different for apparently-similar input sets.

To make matters worse, JIT compilation order is nondeterministic in some JREs; successive executions may choose to optimize the code in different orders. And for particularly large Java application, you may find that the JRE has a limit on how much code it keeps in fully-JITted form, so that variation in compilation order can have surprisingly large performance effects. Even after full warm-up, and factoring out the effects of GC and other asynchronous operations, I found that some releases of some JREs could show run-to-run performance variations of up to 20% for exactly the same code and input.

Java can perform surprisingly well, since the JIT compiler makes it function as a (late-)compiled language. But microbenchmarks are often going to be misleading, and even macrobenchmarks may have to be averaged over multiple loads (not just multiple executions) to get reliably meaningful numbers.

It is a possible answer; benchmark numbers aren't meaningful unless you understand exactly how the benchmark was taken and that's more true in Java than in many languages. I can't say whether it is *the* answer without doing a lot more detailed work than I have time for. — keshlam, Jan 01 '15 at 23:23

score 1 · Answer 4 · answered Jan 01 '15 at 23:22

Using static arrays will speed up maybe not 15 times, but maybe 10 times. And squaring is easier done by multiplication. Using a local variable for vectorA[i] is more a matter of style, and even might make compiler optimization more difficult.

static final double[] vectorA = {0.269852, -0.720015, 0.942508, ... };
static final double[] vectorB = {-1.566838, 0.813305, 1.780039, ... };

public static double calc() {
    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vectorA.length; i++) {
        double a = vectorA[i];
        double b = vectorB[i];
        dotProduct += a * b;
        normA += a * a;
        normB += b * b;
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Thanks. But in reality both the C code and the Java code would receive them as arguments every time. I just wanted to isolate the actual performance of the Math. — Scott Klarenbach, Jan 01 '15 at 23:26
Already was almost certain of that; I just wanted to point out that math is not necessarily as costly, as data shovelling, by separating calculation from the data preparation. — Joop Eggen, Jan 02 '15 at 15:51

Cosine-similarity performance in Java 15 times slower than equivalent C?

4 Answers4