16

In a lot of situations I not only need the sine, but also the cosine of the same parameter.

For C, there is the sincos function in the common unix m math library. And actually, at least on i386, this should be a single assembly instruction, fsincos.

sincos, sincosf, sincosl - calculate sin and cos simultaneously

I guess these benefits exist because there is an obvious overlap in computing sine and cosine: sin(x)^2 + cos(x)^2 = 1. But AFAIK it does not pay off to try to shortcut this as cos = Math.sqrt(1 - sin*sin), as the sqrt function comes at a similar cost.

Is there any way to reap the same benefits in Java? I guess I'm going to pay a price for a double[] then; which maybe makes all the efforts moot because of the added garbage collection.

Or is the Hotspot compiler smart enough to recognize that I need both, and will compile this to a sincos command? Can I test whether it recognizes it, and can I help it recognizing this, e.g. by making sure the Math.sin and Math.cos commands are directly successive in my code? This would actually make the most sense from a Java language point of view: having the comiler optimize this to use the fsincos assembly call.

Collected from some assembler documentation:

Variations    8087         287        387      486     Pentium
fsin           -            -       122-771  257-354   16-126  NP
fsincos        -            -       194-809  292-365   17-137  NP
 Additional cycles required if operand > pi/4 (~3.141/4 = ~.785)
sqrt        180-186      180-186    122-129   83-87    70      NP

fsincos should need an extra pop, but that should come at 1 clock cycle. Assuming that the CPU also does not optimize this, sincos should be almost twice as fast as calling sin twice (second time to compute cosine; so i figure it will need to do an addition). sqrt could be faster in some situations, but sine can be faster.

Update: I've done some experiments in C, but they are inconclusive. Interestingly enough, sincos seems to be even slightly faster than sin (without cos), and the GCC compiler will use fsincos when you compute both sin and cos - so it does what I'd like Hotspot to do (or does Hotspot, too?). I could not yet prevent the compiler from outsmarting me by using fsincos except by not using cos. It will then fall back to a C sin, not fsin.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    Have you tried profiling to determine that this is indeed a bottleneck in your application or to determine which of the approaches above perform better? – andand Nov 19 '12 at 19:34
  • @andand: I cannot use `sincos`, without solving the array thing and doing JNI myself. – Has QUIT--Anony-Mousse Nov 19 '12 at 19:36
  • 3
    But you can initially use a naïve approach and then profile. If this isn't a bottleneck in your application, you should be focusing your performance enhancing efforts elsewhere. If it is, you can look at either two separate calls (one to sin and another to cos) or use the identity function. If it's still a bottleneck then you could justify more esoteric solutions (such as something in JNI or something else somebody might suggest below). – andand Nov 19 '12 at 19:43
  • I'm doing *tons* of these, and I can't really save them. So I know this is a bottleneck. The Euclidean distance version is roughly one order of magnitude faster than the great circle distance version. – Has QUIT--Anony-Mousse Nov 19 '12 at 19:45
  • This would make a good addition to the standard Java math library. Too bad that returning multiple values is still so clunky. This function in particular is a good argument for implementing anonymous value sequences to be such a return type. – eh9 Nov 20 '12 at 04:29
  • For performance reasons, it may even be desirable to have return `struct`s instead of sequences. This is a prime example of where you may want to always get exactly two doubles back in a "hot" loop, without allocating the memory for a return object. Or for references to primitives. JOGL is another example, it uses a lot of primitive arrays as cheap mutable integers. – Has QUIT--Anony-Mousse Nov 20 '12 at 07:11
  • (+1) for the interesting question. – NPE Nov 20 '12 at 11:19
  • @eh9: Actually, all the runtime would have to do to allow for this kind of multi-value return would be to add a few fields to `Thread` to be used for aggregate function returns, and specify that any code which wishes to use the values returned from a function must save them before calling any other function which isn't explicitly specified to preserve them. – supercat Jun 04 '14 at 04:40
  • @supercat Looking up the current thread isn't exactly free either. Extending the language/VM to allow multiple values to be returned on the stack would IMHO be much nicer. Why is Java restricted to having one primitive or object returned at a time? We're storing all kinds of data on the stack: local variables, return values, ... – Has QUIT--Anony-Mousse Jun 04 '14 at 07:02
  • @Anony-Mousse: I would expect that looking up the current thread should be very cheap in any well-designed multi-tasking system, but looking up information which is not statically attached [e.g. by inclusion within the `Thread` base type] is apt to be significantly more expensive. – supercat Jun 07 '14 at 14:19
  • Calculations for sin or cos "generally" produce both results and discard one. This may explain why 'sincos' is marginally quicker than 'sin' or 'cos' (not convinced though). – simon.watts Apr 22 '15 at 09:07

5 Answers5

11

I have performed some microbenchmarks with caliper. 10000000 iterations over a (precomputed) array of random numbers in the range -4*pi .. 4*pi. I tried my best to get the fastest JNI solution I could come up going - it's a bit hard to predict whether you will actually get fsincos or some emulated sincos. Reported numbers are the best of 10 caliper trials (which in turn consist of 3-10 trials, the average of which is reported). So roughly it's 30-100 runs of the inner loop each.

I've benchmarked several variants:

  • Math.sin only (reference)
  • Math.cos only (reference)
  • Math.sin + Math.cos
  • sincos via JNI
  • Math.sin + cos via Math.sqrt( (1+sin) * (1-sin) ) + sign reconstruction
  • Math.cos + sin via Math.sqrt( (1+cos) * (1-cos) ) + sign reconstruction

(1+sin)*(1-sin)=1-sin*sin mathematically, but if sin is close to 1 it should be more precise? Runtime difference is minimal, you save one addition.

Sign reconstruction via x %= TWOPI; if (x<0) x+=TWOPI; and then checking the quadrant. If you have an idea how to do this with less CPU, I'd be happy to hear.

Numerical loss via sqrt seems to be okay, at least for common angles. On the range of 1e-10 from rough experiments.

Sin         1,30 ==============
Cos         1,29 ==============
Sin, Cos    2,52 ============================
JNI sincos  1,77 ===================
SinSqrt     1,49 ================
CosSqrt     1,51 ================

The sqrt(1-s*s) vs. sqrt((1+s)*(1-s)) makes about 0,01 difference. As you can see, the sqrt based approach wins hands down against any of the others (as we can't currently access sincos in pure Java). The JNI sincos is better than computing sin and cos, but the sqrt approach is still faster. cos itself seems to be consistently a tick (0,01) better than sin, but the case distinction to reconstruct the sign has an extra > test. I don't think my results support that either sin+sqrt or cos+sqrt is clearly preferrable, but they do save around 40% of the time compared to sin then cos.

If we would extend Java to have an intrinsic optimized sincos, then this would likely be even better. IMHO it is a common use case, e.g. in graphics. When used in AWT, Batik etc. numerous applications could benefit from it.

If I'd run this again, I would also add JNI sin and a noop to estimate the cost of JNI. Maybe also benchmark the sqrt trick via JNI. Just to make sure that we actually do want an intrinsic sincos in the long run.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • Are those tests using a sensibly fast sine/cosine, or one which adds extra slow code to achieve an "accuracy improvement" which actually degrades accuracy in common usage scenarios? – supercat Jun 04 '14 at 04:43
  • They are using `java.lang.Math.sin`, `java.lang.Math.cos` and `sincos` as provided by the C library. – Erich Schubert Jun 04 '14 at 07:49
1

Most sin and cos calculations are calls directly to the hardware. There isn't much of a faster way to calculate it than that. Specifically, in the range +- pi/4, the rates are extremely fast. If you use hardware acceleration in general, and try to limit the values to those specified, you should be fine. Source.

PearsonArtPhoto
  • 38,970
  • 17
  • 111
  • 142
1

You can always profile.

Generally however, sqrt should come at the same speed as division, as the internal implementation of div and sqrt are very similar.

Sin and cosine, OTOH are calculated with polynomials of up to 10 degrees without any common coefficients and possibly a difficult modulo 2pi reduction -- that is the only common part shared in sincos (when not using CORDIC).

EDIT Revised profiling (with typo corrected) shows timing difference for

sin+cos:  1.580 1.580 1.840 (time for 200M iterations, 3 successive trials)
sincos:   1.080 0.900 0.920
sin+sqrt: 0.870 1.010 0.860
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • Profiling an *assembler* instruction (`fsincos`) apparently not exposed by Java is hard. – Has QUIT--Anony-Mousse Nov 19 '12 at 19:43
  • @Anony-Mousse: Do all architectures that have a JVM implementation have fsincos? Java can't expose anything that isn't cross platform. – durron597 Nov 19 '12 at 19:44
  • 2
    I don't know. But it could still emulate it via sin and cos calls for all others. And for systems without FPU, it may not have sin and cos in assembly either... – Has QUIT--Anony-Mousse Nov 19 '12 at 19:46
  • Well perhaps one can profile it with C/C++/inline assembler to test which approach has the most potential. Of course with sqrt one must choose the proper sign for the sqrt. – Aki Suihkonen Nov 20 '12 at 05:22
  • Note that I'm referring to `sincos(z, &s, &c)`, which needs `#define _GNU_SOURCE` and `double s=0, c=0;`. Interestingly, `sincos` seems to be 10x slower here. From what I googled, this may be because GNU libc optimized `sin` and `cos`, while it uses the FPU `fsincos` assembly for the latter. – Has QUIT--Anony-Mousse Nov 20 '12 at 07:38
  • A) Use different values of `z`. Otherwise, the compiler may cache the results (this happened in my benchmarks, yielding a 10x difference). B) I've noticed that even when I write `sin` and `cos`, my compiler will turn this into an `fsincos` instruction (GCC is damn smart, I don't know if Hotspot can do this, too!). Interestingly, `sincos` was even faster than `sin` in all of my runs. – Has QUIT--Anony-Mousse Nov 20 '12 at 10:21
  • That was the intention. Apparently made a typo z+=0.01, instead of o+=0.01 – Aki Suihkonen Nov 20 '12 at 11:35
1

Looking at the Hotspot code, I am rather convinced that the Oracle Hotspot VM does not optimize sin(a) + cos(a) into fsincos: See assembler_x86.cpp, line 7482ff.

However, I would suspect that the increased number of machine cycles for using fsin and fcos separately is easily outshadowed by other operations such as running the GC. I would use the standard Java features and profile the application. Only if a profile run indicates that a significant time is spent in the sin/cos calls, I would venture out to do something about it.

In this case, I would create a JNI wrapper that uses a 2-element jdoublearray as out parameter. If you have only one thread that uses the sincos JNI operations, you could use a statically initialized double[2] array in your Java code that would be reused over and over again.

nd.
  • 8,699
  • 2
  • 32
  • 42
  • It always depends on your use case. Say in a 3D game, you will have most of this work done by the GPU anyway. So sin and cos aren't that frequent. I'm analyzing data sets of lat-lng pairs, and I have to do lots of geodetic distance computations. Which is why I have been considering this JNI approach. Except that it makes porting the application a lot harder. :-( – Has QUIT--Anony-Mousse Nov 20 '12 at 11:47
  • @Anony-Mousse I have no doubt that you are doing lots of sin+cos calculation. However, I (and maybe you) don't know if the additional CPU cycles for separate calculation have a *significant impact* on your application performance - therefore I suggested using a profiler before going the extra mile and doing JNI. – nd. Nov 20 '12 at 12:21
  • The remainder has been extensively profiled and optimized (and become a hell faster over time), in particular wrt avoiding excess object creation. However, I'm even inclined that it may make more sense to propose a patch to Hotspot instead than fighting JNI. Thanks for looking up this line. Unfortunately, the Hotspot code is really hard to navigate. :-( – Has QUIT--Anony-Mousse Nov 21 '12 at 14:40
1

There is no fsincos available in regular Java. Also, a JNI version may be slower than a double call to java.lang.Math.sin() and cos().

I guess you are concerned about the speed of sin(x)/cos(x). So I give you a suggestion for fast trigonometric operations, in replacement to fsincos: Look Up Table. Below are my original post. I hope it helps you.

=====

I tried to achieve the best possible performance on trigonometric functions (sin and cos), using Look Up Tables (LUT).

What I have found:

  • LUT can be 20-25 times faster then java.lang.Math.sin()/cos(). Possible as fast as native fsin / fcos. Maybe as fast as fsincos.
  • But java.lang.Math.sin() and cos() are FASTER than any other way to calculate sin/cos, if you use angles between 0 and 45 degree;
  • But notice that angles lower than 12 deg has sin(x) almost == x. It is even faster;

  • Some implementations use float array to store sin and another one for cos. This is unnecessary. Just remember that:

cos(x) == sin(x + PI/2)

  • That is, if you have sin(x) table you have cos(x) table for free.

I did some tests with sin() for angles in range [0..45], using java.lang.Math.sin(); a naive look up table for 360 positions, a optimized LUT90 with table values for range[0..90], but expanded to work with [0..360]; and Look up table with interpolation.Note that after warn-up, java.lang.Math.sin() is faster than others:

Size test: 10000000
Angles range: [0.0...45.0]
Time in ms
Trial | Math.sin() | Lut sin() | LUT90.sin() | Lut sin2() [interpolation]
0    312,5879        25,2280        27,7313      36,4127
1    12,9468         19,5467        21,9396      34,2344
2    7,6811          16,7897        18,9646      32,5473
3    7,7565          16,7022        19,2343      32,8700
4    7,6634          16,9498        19,6307      32,8087

Sources available here GitHub

But, if you need high performance in range[-360..360], java.lang.Math lib is slower. A Look up table (LUT) is around 20 times faster. If high precision is required, you can use LUT with interpolation, it is a bit slower but still faster than java.lang.Math. See my sin2() in Math2.java, on link above.

Below numbers are for angle high range:

Size test: 10000000
Angles range: [-360.0...360.0]
Time in ms
Trial|Math.sin() | Lut sin() | LUT90.sin() | Lut.sin2() [interpolation]
0    942,7756        35,1488        47,4198      42,9466
1    915,3628        28,9924        37,9051      41,5299
2    430,3372        24,8788        34,9149      39,3297
3    428,3750        24,8316        34,5718      39,5187
Alex Byrth
  • 1,328
  • 18
  • 23
  • My question is different. I always need *both* `sin(x)` *and* `cos(x)` for the *same* x. There is `fsincos`, but it's not accessible in Java. So I ask about alternatives for the line `double s = Math.sin(x), c = Math.cos(x);` – Has QUIT--Anony-Mousse Feb 15 '16 at 07:17
  • I guess you were concerned about speed to get back sin(x)/cos(x), because is quite easy to write a function to give back both values. The only drawback is double expensive calls you do when using java.lang.Math. So I give you a way to have a 20 times faster sin(x) and cos(x), in a way to not have expensiveness in account. And Yes, there is no fSinCos in pure Java. – Alex Byrth Feb 15 '16 at 09:59