sine calculation orders of magnitude slower than cosine

Question

tl;dr

Of the same numpy array, calculating np.cos takes 3.2 seconds, wheras np.sin runs 548 seconds (nine minutes) on Linux Mint.

See this repo for full code.

I've got a pulse signal (see image below) which I need to modulate onto a HF-carrier, simulating a Laser Doppler Vibrometer. Therefore signal and its time basis need to be resampled to match the carrier's higher sampling rate.

pulse signal to be modulated onto HF-carrier

In the following demodulation process both the in-phase carrier cos(omega * t) and the phase-shifted carrier sin(omega * t) are needed. Oddly, the time to evaluate these functions depends highly on the way the time vector has been calculated.

The time vector t1 is being calculated using np.linspace directly, t2 uses the method implemented in scipy.signal.resample.

pulse = np.load('data/pulse.npy')  # 768 samples

pulse_samples = len(pulse)
pulse_samplerate = 960  # 960 Hz
pulse_duration = pulse_samples / pulse_samplerate  # here: 0.8 s
pulse_time = np.linspace(0, pulse_duration, pulse_samples,
                         endpoint=False)

carrier_freq = 40e6  # 40 MHz
carrier_samplerate = 100e6  # 100 MHz
carrier_samples = pulse_duration * carrier_samplerate  # 80 million

t1 = np.linspace(0, pulse_duration, carrier_samples)

# method used in scipy.signal.resample
# https://github.com/scipy/scipy/blob/v0.17.0/scipy/signal/signaltools.py#L1754
t2 = np.arange(0, carrier_samples) * (pulse_time[1] - pulse_time[0]) \
        * pulse_samples / float(carrier_samples) + pulse_time[0]

As can be seen in the picture below, the time vectors are not identical. At 80 million samples the difference t1 - t2 reaches 1e-8.

difference between time vectors <code>t1</code> and <code>t2</code>

Calculating the in-phase and shifted carrier of t1 takes 3.2 seconds each on my machine.
With t2, however, calculating the shifted carrier takes 540 seconds. Nine minutes. For nearly the same 80 million values.

omega_t1 = 2 * np.pi * carrier_frequency * t1
np.cos(omega_t1)  # 3.2 seconds
np.sin(omega_t1)  # 3.3 seconds

omega_t2 = 2 * np.pi * carrier_frequency * t2
np.cos(omega_t2)  # 3.2 seconds
np.sin(omega_t2)  # 9 minutes

I can reproduce this bug on both my 32-bit laptop and my 64-bit tower, both running Linux Mint 17. On my flat mate's MacBook, however, the "slow sine" takes as little time as the other three calculations.

I run a Linux Mint 17.03 on a 64-bit AMD processor and Linux Mint 17.2 on 32-bit Intel processor.

If you switch the order that you invoke these, are the results consistent? (Just hypothesising that this might be due to some sort of memory/cache issue - each of these produces a 640MB vector.) — Oliver Charlesworth, Mar 05 '16 at 14:14
Do the libraries numpy is linked against differ between the computers? — MSeifert, Mar 05 '16 at 15:29
they seem to differ, but I can't do anything with it: https://gitlab.com/Finwood/numpy-sine/blob/master/np-config.txt — Finwood, Mar 05 '16 at 16:24
I wonder if other Linux distros have this problem. I'm using OS X and can't easily test it. If it's just Mint, that seems like something the maintainers might want to hear about. — senderle, Mar 05 '16 at 17:38
@ali_m, fake data is inside the [linked git repository](https://gitlab.com/Finwood/numpy-sine.git). Clone, then `make test` — Finwood, Mar 05 '16 at 18:42
@MSeifert BLAS/LAPACK linkage is irrelevant here, but the numpy version might be — ali_m, Mar 05 '16 at 18:42
@ali_m - I thought some libraries can boost the trigonometric functions (small angle approximations, etc.) i.e. mkl. — MSeifert, Mar 05 '16 at 18:57
@MSeifert It might conceivably make a difference if `multiarray.so` was statically compiled against MKL, but `show_config()` only tells you about dynamic dependencies relating to BLAS/LAPACK routines. — ali_m, Mar 05 '16 at 19:03
Remember, sin(y) = cos(90-y), or cos(y) = sin(90-y). Use this to your advantage — Java Man Tea Man, Mar 06 '16 at 00:35

score 18 · Accepted Answer · edited Jan 18 '21 at 12:35

I don't think numpy has anything to do with this: I think you're tripping across a performance bug in the C math library on your system, one which affects sin near large multiples of pi. (I'm using "bug" in a pretty broad sense here -- for all I know, since the sine of large floats is poorly defined, the "bug" is actually the library behaving correctly to handle corner cases!)

On linux, I get:

>>> %timeit -n 10000 math.sin(6e7*math.pi)
10000 loops, best of 3: 191 µs per loop
>>> %timeit -n 10000 math.sin(6e7*math.pi+0.12)
10000 loops, best of 3: 428 ns per loop

and other Linux-using types from the Python chatroom report

10000 loops, best of 3: 49.4 µs per loop 
10000 loops, best of 3: 206 ns per loop

and

In [3]: %timeit -n 10000 math.sin(6e7*math.pi)
10000 loops, best of 3: 116 µs per loop

In [4]: %timeit -n 10000 math.sin(6e7*math.pi+0.12)
10000 loops, best of 3: 428 ns per loop

but a Mac user reported

In [3]: timeit -n 10000 math.sin(6e7*math.pi)
10000 loops, best of 3: 300 ns per loop

In [4]: %timeit -n 10000 math.sin(6e7*math.pi+0.12)
10000 loops, best of 3: 361 ns per loop

for no order-of-magnitude difference. As a workaround, you might try taking things mod 2 pi first:

>>> new = np.sin(omega_t2[-1000:] % (2*np.pi))
>>> old = np.sin(omega_t2[-1000:])
>>> abs(new - old).max()
7.83773902468434e-09

which has better performance:

>>> %timeit -n 1000 new = np.sin(omega_t2[-1000:] % (2*np.pi))
1000 loops, best of 3: 63.8 µs per loop
>>> %timeit -n 1000 old = np.sin(omega_t2[-1000:])
1000 loops, best of 3: 6.82 ms per loop

Note that as expected, a similar effect happens for cos, just shifted:

>>> %timeit -n 1000 np.cos(6e7*np.pi + np.pi/2)
1000 loops, best of 3: 37.6 µs per loop
>>> %timeit -n 1000 np.cos(6e7*np.pi + np.pi/2 + 0.12)
1000 loops, best of 3: 2.46 µs per loop

just for completeness: I get ``%timeit -n 1000000 math.sin(6e7*math.pi+0.12)``: ``1000000 loops, best of 3: 461 ns per loop`` and ``%timeit -n 1000000 math.sin(6e7*math.pi)``: ``1000000 loops, best of 3: 425 ns per loop`` with Windows. — MSeifert, Mar 05 '16 at 19:17
Might this have to do with [denormal numbers](https://en.wikipedia.org/wiki/Denormal_number)? I remember writing some floating point code that got extremely slow when very small, but nonzero numbers were involved. — cfh, Mar 05 '16 at 23:22
but this "bug" comes into effect with _big_ numbers, not the close-to-zero ones... — Finwood, Mar 06 '16 at 10:24
@Finwood Not an explanation, but if the issue is big numbers, can you just take it mod 2pi? — Paul, Mar 06 '16 at 22:28

hotpaw2 · Answer 2 · 2016-03-06T06:43:04.900

One possible cause of these huge performance differences might be in how the math library creates or handles IEEE floating point underflow (or denorms), which might be produced by a difference of some of the tinier mantissa bits during transcendental function approximation. And your t1 and t2 vectors might differ by these smaller mantissa bits, as well as the algorithm used to compute the transcendental function in whatever libraries you linked, as well as the IEEE arithmetic denorms or underflow handler on each particular OS.

sine calculation orders of magnitude slower than cosine

tl;dr

2 Answers2