In Intel Xeon Phi there are 32 512-bit-wide vector registers per core. Each vector register can do 16 single precision floating point operation per cycle. And 2 operations can be done in 1 cycle (1 in the v-pipe and 1 in the u-pipe).
I want to know how many scalar multiplications can be done in 1 clock cycle apart from the vector multiplications done in the vector registers.