Questions tagged [fma]

Fused Multiply Add or Multiply-Accumulate

The Fused Multiply Add (also known as Multiply Accumulate) operation is when a multiplication followed by an addition or subtraction is done in a single operation with only one rounding at the end.

For example:

x = a * b + c

Would normally be done using two roundings without Fused-Multiply Add. (one after a * b and one after a * b + c)

Fused Multiply Add combines the two operations into a single operation thereby increasing accuracy in the computed result.

Supported Architectures include:

PowerPC
Intel x86 (via FMA3 instruction set)
AMD x86 (via FMA4 instruction set)

82 questions

vote

0 answers

Latency and number of FMA units

I'm trying to implement the convolution algoritm descibed in this paper. The authors state that the number of independent elements processed by FMA instructions is lower bounded by the latency of FMA istructions and it is upper bounded by the…

asked Apr 11 '22 at 13:44

Mirco Mannino

vote

2 answers

Difference between FMA and naive a*b+c?

In the BSD Library Functions Manual of FMA(3), it says "These functions compute x * y + z." So what's the difference between FMA and a naive code which does x * y + z? And why FMA has a better performance in most cases?

ieee-754 instruction-set fma

asked Aug 21 '19 at 21:15

Patroclus

1,163
13
31

vote

1 answer

How to solve "illegal instruction" for vfmadd213ps?

I have tried AVX intrinsics. But it caused "Unhandled exception at 0x00E01555 in test.exe: 0xC000001D: Illegal Instruction." I used Visual studio 2015. And the exception error is caused at "vfmadd213ps ymm2,ymm1,ymm0" instruction. I have tried set…

c assembly simd avx fma

asked Jul 23 '19 at 01:31

hbs

vote

1 answer

Is there a way to use OpenCL C mad function in Vulkan SPIR-V?

As we know, there's at least 2 ways to calculate a * b + c: ret := a*b; ret := ret + c; ret := fma(a, b, c); But in OpenCL C, there's a third function called "mad" that trades precision for performance. In the LunarG sdk, the default SPIR-V…

glsl opencl vulkan fma

asked Jun 24 '19 at 08:31

DannyNiu

1,313
8
27

vote

0 answers

tensorflow-1.12.0rc1-cp27-cp27mu-linux_x86_64.whl is not a supported wheel on this platform

I installed tensor flow on Intel NUC with pip3 pip3 install --upgrade tensor flow But got below error 2018-10-25 20:14:31.685641: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was…

linux tensorflow pip avx2 fma

asked Oct 25 '18 at 07:42

Neeraj Sharma

vote

1 answer

How to avoid the error of AVX2 when the matrix dimension isn't multiples of 4?

I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx. However, I got the error "incorrect checksum for freed object - object was probably modified after being freed." I think the error would…

c gcc avx avx2 fma

asked Jul 22 '18 at 06:25

Mic

vote

1 answer

What is the instruction number per cycle in fma with minus?

If I use fma(a, b, c) in cuda, it means that the formula ab+c is calculated in a single ternary operation. But if I want to calculate -ab+c, does the invoking fma(-a, b, c) take one more multiply operation ?

cuda fma

asked Sep 02 '17 at 07:11

Jannus YU

vote

1 answer

_mm_fmadd_pd Program received signal SIGILL, Illegal instruction

I am getting a weird error for the following code: #include #include #include inline static double myfma(double x,double y, double z) { double r; // result …

c++ x86 simd intrinsics fma

asked Mar 17 '17 at 09:36

Rubén Darío Guerrero

vote

0 answers

Are there any FMA gains for negative accumulator?

Working with c++ AMP, I'm trying to optimize my math functions. Ran into a bit of a conundrum with cross product: float_3 CrossProduct(float_3 v1, float_3 v2) restrict(amp) { float a = mad(v1.y, v2.z, -v1.z * v2.y); float b = mad(v1.z, v2.x,…

c++ c++-amp fma

asked Feb 20 '15 at 11:29

user81993

6,167
6
32
64

votes

0 answers

GCC 12 (minGW 64): how to enable fused multiply add code generation

I apologize in advance in case the answer to my question is obvious but trust me, I have been googling the whole day and searched here aswell without finding anything relevant to it. I am using GCC 12 (minGW x64) on my x64 windows i7 setup. I don't…

assembly gcc optimization avx fma

asked Nov 13 '22 at 18:53

elena

votes

0 answers

How is fast fma() implemented

FMA is a fused multiply-add instruction. The fmaf (float x, float y, float z) function in glibc calls the vfmadd213ss instruction. The fma(double x, double y, double z) use double type. The enter link description here give a software implementation…

math glibc pow fma

asked Oct 11 '22 at 03:37

xiaohuihui

votes

0 answers

vfmadd231ps Floating Point Exception c0000090

What is the problem with the vfmadd231ps instruction which throws a floating point Exception? 00007ff9`88f05108 62b26558b81482 vfmadd231ps zmm2,zmm3,dword bcst [rdx+r8*4] ds:00000220`7de9dc00=00000000 0:112> .exr -1 …

assembly x86 windbg avx fma

asked Nov 19 '21 at 16:26

Alois Kraus

13,229
1
38
64

votes

0 answers

Is there any better implemention for integer 'mul and add' with avx?

I've just learned how to optimize GEMM with x86 vector registers, and we were given matrices whose entries are 32-bit int, and just neglect the overflow for simplification. There's a _mm256_fmadd_pd for double floating-point numbers to update the…

c++ x86 avx avx2 fma

asked Aug 03 '21 at 08:24

TimeOrange

votes

1 answer

How to find magic multipliers for divisions by constant on a GPU?

I was looking at implementing the following computation, where divisor is nonzero and not a power of two unsigned multiplier(unsigned divisor) { unsigned shift = 31 - clz(divisor); uint64_t t = 1ull << (32 + shift); return t /…

math floating-point gpu division fma

asked Feb 04 '21 at 22:53

amonakov

2,324
11
23

votes

1 answer

incompatible types when assigning to type ‘__m256d’ from type ‘int’

I'm working on a project to optimize Matrix Multiplication and I'm trying to use intrinsics. Here's a bit of the code I'm using : #include /* Vector tiling and loop unrolling */ static void do_block(int lda, int M, int N, int K,…

c avx fma

asked Jan 03 '21 at 11:20

Mehdi

Prev 1 2 3 4

6 Next