0

Haswell now has 2 Branch Units - as shown here: http://arstechnica.com/gadgets/2013/05/a-look-at-haswell/2/

enter image description here

Does it mean that Haswell is dual path execution CPU?

In terms of: http://ditec.um.es/~jlaragon/papers/aragon_ICS02.pdf

And does it mean that Haswell can execute 2-nd branch only on Integer ALU & Shift (Port 6) and not on any other ALU on other Ports?

Alex
  • 12,578
  • 15
  • 99
  • 195
  • 4
    I don't think this question is so unclear that it should be closed. It's full of misconceptions (like some of this user's previous questions), but not to the point where it's unanswerable. I did have to kind of guess at what the extra question in the last paragraph was supposed to be. It would be a better question if it included a summary of the paper like I did in my answer, though, since the question would become unanswerable and meaningless if that link broke. – Peter Cordes Jul 14 '16 at 05:20

2 Answers2

6

No, Haswell still only speculates along the predicted side of a branch.

The branch unit on port0 can only execute predicted not-taken branches, as you can see from Agner Fog's instruction tables. This speeds up execution of a big chain of compare-and-branch where most of them are not-taken. This is not unusual in compiler-generated code.

See David Kanter's Haswell writeup, specifically the page about execution units. If Haswell had introduced the feature described in that paper you linked, Kanter's writeup would have mentioned it, and so would Intel's optimization manual, and Agner Fog's microarch pdf. (See the tag wiki for links to that and more).

One big advantage to the integer/branch unit on port6 is that it's not shared with any of the vector execution ports. So a loop can have 3 vector ALU uops and a branch, and still run at one iteration per cycle. David Kanter's writeup says the same thing.


And does it mean that Haswell can execute 2-nd branch only on Integer ALU & Shift (Port 6) and not on any other ALU on other Ports?

If the idea from that paper was implemented, it would affect the whole pipeline, not just the port that executes branches!

From the paper:

Dual Path Instruction Processing (DPIP) is proposed as a simple mechanism that fetches, decodes, and renames, but does not execute, instructions from the alternative path for low confidence predicted branches at the same time as the predicted path is being executed.

So in fact there would be no execution unit involved for the alternate path. This should be obvious...

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you! I.e. Branch[Port-6] is only for that "a loop can have 3 vector ALU uops and a branch, and still run at one iteration per cycle."? Or also Branch[Port-6] allows 2 threads of Hyper Threading to go on other branches? – Alex Jul 13 '16 at 22:54
  • 2
    @Alex: Hyperthreading has nothing to do with this. The out-of-order core can only retire one predicted-taken branch per clock. Branch prediction happens much earlier in the pipeline, though. Also, the 3 vector ALU uops is just one example use case. port6 can run simple ALU ops like `add`, so their throughput is 4 per clock on Haswell vs. 3 per clock on IvB/SnB. – Peter Cordes Jul 13 '16 at 22:57
3

You don't need to execute both paths - given that there's usually a branch about every 5 instructions on average, that would be difficult since you'd soon end up with an exponential number of paths. Even if you only diverge like that on hard-to-predict branches, you could still end up with a significant number of parallel paths.

The reason for adding a second branch unit is much simpler - in an out-of-order machine, even computing a single predicted "main" path of execution, you would still end up with a large number of in-flight branches. Note that prediction is done at the beginning of the pipeline, so it's decoupled from the actual execution and resolution of each branch. In practice, the front-end will feed the machine with branches, and the OOO machine needs to resolve dependencies and execute them as fast as possible (since you want to resolve the predictions as early as you can, and recover if you were wrong). I guess the designers discovered that additional execution bandwidth is needed, since there could be cases where multiple branches (that may not even be consecutive in program order) get their sources ready simultaneously and suddenly need to execute all at once. Hence the comment about "2nd EU for high branch code".

Aside from branches conflicting with each other, you can see that execution port 0 is also burdened with many other types of operations, so you could also have a branch ready to execute but stalled because of other non-branch operations. Hence the other comment about port0 conflicts (in theory, they could have just moved the branch execution unit to another port, but that would add other conflicts, and it won't resolve the branch vs branch conflicts).

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • Pre-Haswell CPUs have the branch unit on port5, so for example FP-heavy code can saturate ports 0/1 with FP mul and add uops and have the loop overhead run (hopefully mostly) on p5. – Peter Cordes Jul 16 '16 at 08:52
  • Interesting point about discovering mispredicts sooner. I was mostly thinking of branch throughput for branch-heavy code, not latency. I'm not sure if the frontend can handle more than one predicted-taken branch per clock. The uop cache caches traces, so it's maybe possible. But if not, that explains why port0 only handle predicted-not-taken branches: The frontend can only sustain one taken branch per clock anyway. Or maybe the other reason is to make sure predicted-taken loop branches never steal p0 and reduce the vector ALU throughput. – Peter Cordes Jul 16 '16 at 08:59
  • @PeterCordes, what do you mean by "handle", predicting or recovering? i'm not sure the front-end can or should recover more than one, but not all executed branches result in a misprediction anyway. If it's about prediction - the front-end and back-end may have decoupled bandwidth - you can predict 1 branch per cycle and still get a local congestion at the backend (for example - a `switch(x)` will have any number of branches (cases) ready to execute once x is generated), regardless of how long it took the front-end to feed them into the OOO machine. – Leeor Jul 16 '16 at 09:32
  • I meant can the front-end issue a group of up to 4 uops with two predicted-taken branches in the same cycle. That would mean two extra changes in RIP in the same cycle. IIRC, a predicted-taken branch ends an issue group. e.g. a 6 uop loop runs at best one iteration per 2 clocks, not one per 1.5. (Because it issues ABCD EF / ABCD EF. Not ABCD EFAB / CDEF). And like you mentioned, I also guessed that the branch predictor can probably only generate one prediction per cycle. – Peter Cordes Jul 16 '16 at 10:38
  • I'm not sure exactly when branch prediction happens. If predicted-taken and predicted-not-taken uops can sit in the loopback buffer without needing to be re-predicted, it should be possible to sustain issuing a 4 uop look with a not-taken branch in the body and a taken branch at the end. If not, then the extra execution capacity for not-taken branches is probably mostly useful for cases like you mentioned, where `x` isn't ready until after several branches have issued. This exact case alone is maybe common enough to justify the extra branch unit. – Peter Cordes Jul 16 '16 at 10:52
  • Hrm, prediction must happen well before issue, because there's a queue between the uop cache and the issue stage. (The same buffer that's used for loops). So probably two predicted-taken branches could issue in the same cycle. I was only trying to come up with reasons why the branch unit on port0 only handles predicted-not-taken branches, and maybe this is the wrong approach. Avoiding loop branches stealing port0 in vector loops is probably significant, and that efficient branch-heavy code will have a lot of not-taken branches. – Peter Cordes Jul 16 '16 at 10:59
  • These are all good questions, wish I knew, but it's probably very design specific. It does make sense that uop caching (or loop stream detection for that matter) will have different restrictions because you don't need to lookup the targets. My point was that regardless of the peak BW in which the front-end produces branches - executing 2 branches per cycle can prove useful. – Leeor Jul 16 '16 at 12:49
  • Yes, I agree with that. But it doesn't explain why they'd make the branch unit on port0 only handle predicted-not-taken uops. Sustained branch throughput might be part of it if, or it might not. Like you say, burst branch throughput is useful on its own. I wasn't trying to disagree with anything you said, just see what I could conclude based on the known facts. – Peter Cordes Jul 16 '16 at 13:20
  • @PeterCordes I was also surprised by your claim that port0 only executes predicted not-taken branches. But I did look at the Agner Fog source you referenced so I see where it comes from. I guess that executing only predicted not-taken branches on port0 could avoid having two execution units contending for BTB access (since BTB is only needed to check the target for taken branches). I suppose it's also possible that Agner Fog could be wrong, I didn't see anything in the Intel documentation saying that the second branch EU only executed predicted not-taken branches. – Gabriel Southern Jul 16 '16 at 16:41
  • @GabrielSouthern: I'm pretty sure that the BTB result is needed *way* before the branch executes, so neither execution unit accesses the BTB at all. I think avoiding loop branches stealing port0 is a sufficient reason. As I understand it, uops are allocated to ports at issue time, not dispatch time, and I'm not clear on how that stage decides which port to choose. – Peter Cordes Jul 16 '16 at 18:58
  • @PeterCordes: yes you are correct the BTB is needed when the prediction is made, which will be long before the branch is executed. I was thinking there might be a check of the BTB after executing the branch to be sure the BTB had the correct prediction information. But on further consideration that doesn't really seem like a good explanation. Your suggestion makes sense, but without more information from Intel it's probably difficult to know for sure what the reasons are. – Gabriel Southern Jul 16 '16 at 20:20