No, Haswell still only speculates along the predicted side of a branch.
The branch unit on port0 can only execute predicted not-taken branches, as you can see from Agner Fog's instruction tables. This speeds up execution of a big chain of compare-and-branch where most of them are not-taken. This is not unusual in compiler-generated code.
See David Kanter's Haswell writeup, specifically the page about execution units. If Haswell had introduced the feature described in that paper you linked, Kanter's writeup would have mentioned it, and so would Intel's optimization manual, and Agner Fog's microarch pdf. (See the x86 tag wiki for links to that and more).
One big advantage to the integer/branch unit on port6 is that it's not shared with any of the vector execution ports. So a loop can have 3 vector ALU uops and a branch, and still run at one iteration per cycle. David Kanter's writeup says the same thing.
And does it mean that Haswell can execute 2-nd branch only on Integer ALU & Shift (Port 6) and not on any other ALU on other Ports?
If the idea from that paper was implemented, it would affect the whole pipeline, not just the port that executes branches!
From the paper:
Dual Path Instruction
Processing (DPIP) is proposed as a simple mechanism that
fetches, decodes, and renames, but does not execute, instructions
from the alternative path for low confidence predicted branches at
the same time as the predicted path is being executed.
So in fact there would be no execution unit involved for the alternate path. This should be obvious...