Questions tagged [sse4]

Intel's Streaming SIMD Extensions 4 instruction set for x86 processors.

Intel's Streaming SIMD Extensions 4 instruction set for Intel Core architecture x86 processors and AMD's K10 x86 processors. It introduces 47 new SSE instructions in total.

These instructions encompass Intel's SSE4.1 and SSE4.2 instruction sets as well as AMD's SSE4a instruction set. More detailed information on the new instruction can be found in both Intel's and AMD's developer manuals or more conveniently on Wikipedia.

55 questions
2
votes
1 answer

Does AVX or AVX2 support 256 bit string instructions and mullo for unsigned short?

I researched about string instructions that is supported in AVX or AVX2 ISA but I can not find any 256 bit string comparison instruction like SSE4.2 If there is any string comparison that I can not find where can I find them? Otherwise Why AVX/AVX2…
ADMS
  • 117
  • 3
  • 18
2
votes
1 answer

SSE: How to extract the sign bit for each packed byte, into a packed register?

Given packed bytes in xmm0, what is an efficient way to extract the sign (i.e. highest-order) bit of each byte into xmm1? In other words I want to compute the logical AND with 0x80 for each packed byte. For example: xmm0: 0xff 0xef 0x80 0x7f 0x01…
jacobsa
  • 5,719
  • 1
  • 28
  • 60
2
votes
1 answer

Getting min short value in a __m128i vector with SSE?

This question seems similar to Getting max value in a __m128i vector with SSE? but with shorts and minimum instead of integer + maximum. This is what I came up with: typedef short int weight; weight horizontal_min_Vec4i(__m128i x) { __m128i…
Alexandros
  • 2,160
  • 4
  • 27
  • 52
2
votes
1 answer

pcmpestri instruction to write similar strpos function?

How can the pcmpestri instruction be used to write a function similar to strpos function in C++? I can use g++ compiler. pcmpestri is a new instruction that is found in SSE4
I'll-Be-Back
  • 10,530
  • 37
  • 110
  • 213
1
vote
1 answer

Inline-Assembler-Code in C, copy values from Array to xmm

I have two Arrays and I want to get the dot product. How do I get the values of vek and vec into xmm0 and xmm1? And how do I get the Value standing in xmm1 (??) so that I can use it for "printf"? #include main(){ float vek[4] = {4.0, 3.0,…
degude
  • 365
  • 2
  • 4
  • 10
1
vote
0 answers

The correct way to search for a substring in a string

the most part of my question is: how to deal with the cases when a string loaded into __m128i contains only part of a substring? the requirement: to search escaped sequences or the '"' (double quoting, not escaped) at the same time. examples: (there…
niXman
  • 71
  • 7
1
vote
1 answer

Why does the pseudocode of _mm_insert_ps calculate %8?

Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined: FOR j := 0 to 3 i := j*32 IF imm8[j%8] dst[i+31:i] := 0 ELSE dst[i+31:i] := tmp2[i+31:i] FI ENDFOR . The…
Brotcrunsher
  • 1,964
  • 10
  • 32
1
vote
1 answer

Optimizing find_first_not_of with SSE4.2 or earlier

I am writing a textual packet analyzer for a protocol and in optimizing it I found that a great bottleneck is the find_first_not_of call. In essence, I need to find if a packet is valid if it contains only valid characters, faster than the default…
senseiwa
  • 2,369
  • 3
  • 24
  • 47
1
vote
1 answer

Move data from memory(could be of any length) to XMM

I know little much of assembly(NASM), i wanted to perform string operation(substring present or not) using SSE4.2. So i learnt how PCMPESTRI, PCMPISTRM works. I am stuck in the middle i.e data transfer from memory to xmm register. Basically, I…
Sanket
  • 65
  • 1
  • 5
1
vote
1 answer

builtin pcmpistri not working in gcc

I'm trying to write a strcmp version that takes advantage of SSE4.2 new instructions leveraging GCC intrinsics. This is the code I have so far: #include #include int main(int argc, char const *argv[]) { int n; const…
Samuele Pilleri
  • 734
  • 1
  • 7
  • 17
1
vote
1 answer

Installing TensorFlow from Sources, on windows 10

I have already installed tensorflow-gpu, and it is working fine. I now want to install tensorflow-gpu from source to take advantage of AVX and SSE4.2-1.0 instruction set, given my system configuration below; CPU : Dual Intel Xeon E5 2670, Sandy…
Yog Gaj
  • 21
  • 2
1
vote
3 answers

How to load 96 bits from memory into an XMM register?

Say I have a pointer to memory in rsi, and I would like to load the 12-byte value pointed to into the low 96 bits of xmm0. I don't care what happens to the high 32 bits. What's an efficient way to do this? (Side question: the best I've come up with…
jacobsa
  • 5,719
  • 1
  • 28
  • 60
1
vote
1 answer

_mm_testc_ps and _mm_testc_pd vs _mm_testc_si128

As you know, the first two are AVX-specific intrinsics and the second is a SSE4.1 intrinsic. Both sets of intrinsics can be used to check for equality of 2 floating-point vectors. My specific use case is: _mm_cmpeq_ps or _mm_cmpeq_pd, followed…
user1095108
  • 14,119
  • 9
  • 58
  • 116
1
vote
0 answers

Segmentation fault with -xSSE4.1 flag

I am getting a segmentation fault while running my executable which was built with -xSSE4.1 compilation-flag. I am running it on a machine which supports SSE4.1, SSE4.2 and AVX. The intrinsics which is giving segmentation fault: m_best_cost_0 …
MediocreMyna
  • 269
  • 1
  • 5
  • 12
1
vote
1 answer

SSE 4.2: alternative to _mm_cmpistri

I wrote a program that runs _mm_cmpistri to get the next \n (newline) character. While this works great on my computer, it fails on a server due to missing SSE 4.2 support. Is there a good alternative using SSE commands <= SSE 4.1?
moo
  • 486
  • 8
  • 22