How to hide SHLD delay?

Question

I have a simple bit reader which uses the SHLD instruction (__shiftleft128) to read a bit stream.

This works great. However, I have been doing some profiling and I notice that whatever instruction comes after the SHLD instruction takes a lot of time.

    Assembly                    CPU Time    Instructions Retired
add r10b, r9b                   19.000ms    92,000,000
cmp r10b, 0x40                  58.000ms    180,000,000
jb 0x140016fa6 <Block 24>       
        Block 23:       
and r10b, 0x3f                  43.000ms    204,000,000
mov r15, r11                    30.000ms    52,000,000
mov qword ptr [rbp+0x20], r11       
add rbx, 0x8                    16.000ms    78,000,000
mov qword ptr [rbp+0x10], rbx       
mov r11, qword ptr [rbx]        6.000ms     44,000,000
bswap r11                       2.000ms 
mov qword ptr [rbp+0x28], r11   8.000ms     20,000,000
        Block 24:       
mov rdx, r15                    61.000ms    208,000,000
movzx ecx, r10b                 1.000ms     6,000,000
**shld** rdx, r11, cl           24.000ms    58,000,000
inc edi                       **127.000ms** 470,000,000

As you can see in the table above the inc instruction after the shld instruction takes a lot of time (8% CPU time).

I would like to know a bit more about why this is the case and how I can avoid it? Is there any instructions that can run in parallel with an shld on cpu level?

I remember reading about shld in some AMD optimziation manual but I can't find it again.

If that's part of a loop are you able to measure how many cycles each iteration is taking? I never trust profiling numbers down to the instruction since they can easily be off a bit. — Mysticial, Aug 17 '12 at 19:20
Mystical: It's a part of a loop with a lot of more code. It's just a hotspot in the loop. Any instruction after shld has been consistently slow, even in earlier version of teh code that looked different. — ronag, Aug 17 '12 at 19:22
How many cycles is each iteration taking? Can you show the entire loop in C++ as well? The reason why I say that you can't trust those per-instruction numbers is because of the superscalar out-of-order execution. — Mysticial, Aug 17 '12 at 19:29
Assuming you have used `CPU_CLK_UNHALTED`, the counts are usually delayed by one instruction. So it is the `shld` instruction which is slow. This can happen, it generates more µops the most other instructions and the delay from the preceding loads may add up to it. BTW, increase the sample frequency or size. — Gunther Piez, Aug 17 '12 at 21:33

perilbrain · Answer 1 · 2012-08-17T19:58:34.533

Hard to tell but seems like the delay is a result of some exception handling routine.

Behavior

However Intel manual specifies a few cases for shld where undefined response is invoked:-

The destination operand can be a register or a memory location; the source operand is a register. The count operand is an unsigned integer that can be stored in an immediate byte or in the CL register. If the count operand is CL, the shift count is the logical AND of CL and a count mask. In non-64-bit modes and default 64-bit mode; only bits 0 through 4 of the count are used. This masks the count to a value between 0 and 31. If a count is greater than the operand size, the result is undefined.

If the count is 1 or greater, the CF flag is filled with the last bit shifted out of the destination operand and the SF, ZF, and PF flags are set according to the value of the result. For a 1-bit shift, the OF flag is set if a sign change occurred; otherwise, it is cleared. For shifts greater than 1 bit, the OF flag is undefined. If a shift occurs, the AF flag is undefined. If the count operand is 0, the flags are not affected. If the count is greater than the operand size, the flags are undefined.

Exception for shld:-

In Protected Mode --> #GP(0),#SS(0),#PF(fault-code),#AC(0),#UD

UPDATE:: Gotcha:-->
First the definition:-

Instructions Retired — Event select C0H, Umask 00H
This event counts the number of instructions at retirement. For instructions that consist of multiple micro-ops, this event counts the retirement of the last microop of the instruction. An instruction with a REP prefix counts as one instruction (not per iteration). Faults before the retirement of the last micro-op of a multiops instruction are not counted.
This event does not increment under VM-exit conditions. Counters continue counting during hardware interrupts, traps, and inside interrupt handlers.

inc edi **127.000ms** 470,000,000(instruction retired)
From the above definition its quite clear that either this instruction breaks into too many micro-ops or some interrupt handler is simultaneously running.

So I think only way is to report this to the processor vendor, because any such behavior is not documented anywhere. Searched everywhere :( — perilbrain, Aug 17 '12 at 19:30
That undefinedness only applies to specifying a count which (modulo 32) is bigger than 16 for a 16bit `shld`. And there wouldn't be an exception, just an undefined result. — harold, Aug 17 '12 at 19:36

How to hide SHLD delay?

1 Answers1