3

I browsed through the Tilera Instruction Set and noticed it has only add, subtract, multiply, divide, and shifts. There is no mention of roots, powers, logs, etc.

I also noticed that SSE (in all flavors) does not have the latter mentioned instructions.

Both the Tilera and SSE are designed for math-based processing such as video encoding, so this has made me curious.

How would one perform one of these operations in such cases (Tilera & SSE [excluding regular scalar ops])?

IamIC
  • 17,747
  • 20
  • 91
  • 154
  • You implement them in software. It takes a lot of silicon to implement a natural log, for example, in hardware... – Mysticial Apr 20 '12 at 20:04
  • @Mystical... funny you should answer. I was on your website an hour ago. Ok, so for Intel, you would have to use the scalar FP instructions, and for Tilera, you'd have some performance-awful code? As in 2 orders of magnitude slower that in silicon? – IamIC Apr 20 '12 at 20:05
  • lol XD... small world. Anyways, they can be vectorized if there isn't too much data-dependent branching. Just take whatever existing software implementation, replace everything with SIMD and now you can do two operations at the same time. I believe the Intel Compiler has (vertically vectorized) implementations of all the math functions. – Mysticial Apr 20 '12 at 20:09
  • VERY small world! :) So a simple formula: x = y ^ 1.5 can be vectorized using, I assume, some form of multiplication? – IamIC Apr 20 '12 at 20:12
  • 1
    Yeah. `x = y ^ 1.5` or any other function would eventually break down to adds/subtracts/multiplies, etc... So they are usually vectorizable if there isn't too much data-dependent branching. – Mysticial Apr 20 '12 at 20:13
  • Ok, thanks. You answered my question. But I can't click accept ;) – IamIC Apr 20 '12 at 20:14
  • I'll make it an answer then. :) – Mysticial Apr 20 '12 at 20:15
  • I would love to see how sin() and log() is broken into +-*/! – IamIC Apr 20 '12 at 20:18
  • 1
    Those can be done using their Taylor Series. It might take an extra argument reduction step, but the expensive work can be done using vectorizable Taylor Series evaluation. – Mysticial Apr 20 '12 at 20:23
  • You're the exact type of programmer that is hard to find. – IamIC Apr 20 '12 at 20:25
  • 1
    For sin/cos there's an alternative described here: http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/ – harold Apr 21 '12 at 10:34
  • On a related topic, I was trying to figure out how Intel's Nights Corner could produce a sustainable 1 TFLOP DP. It doesn't add up unless the number of cores is way more than hinted. You haven't by any chance crunched the figures, @Mysticial? – IamIC Apr 21 '12 at 17:24
  • I haven't looked at Night's Corner (or even Knight's Ferry) yet so I wouldn't know. But there's probably some underlying SIMD that producing a large factor of the performance. – Mysticial Apr 21 '12 at 17:59
  • Well, the SIMD vectors are 512 bit. The CPU is about 1.2 GHz per what I've read, and has an estimated 64 cores. That is simply not enough for 1 TFLOP DP (and it's claimed to be sustainable, so one would assume there is also I/O, not some synthetic benchmark). I estimate the chip could get that only for SP. – IamIC Apr 21 '12 at 20:33

1 Answers1

3

To keep the hardware simple, they usually only implement the most basic and simple instructions that are most commonly used.

The most advanced functions are less commonly used and also take up a lot of silicon and die-space on the processor. Trig-functions, logs, powers, etc. are hard and expensive to implement.

In any case, nearly all special functions break down into basic operations (add/subtract/multiply/divide) so as long as you provide those, you can implement anything.

Vectorizing a special function is usually possible if there isn't too much data-dependent branching. As you can simply take the scalar implementation and replace everything with SIMD versions.

Mysticial
  • 464,885
  • 45
  • 335
  • 332