After trying a few things on paper I've come up with something that might work for you. It is a properly parallelized/vectorized implementation of the function in SSE.
It does however require data reorganization because we'll do parallel processing for 4 triangles at once.
I will break it up in steps and use instruction names here and there, but please use the C intrinsics (_mm_load_ps(), _mm_sub_ps() et al, they're in xmmintrin.h in VC) -- when I speak of registers this just means __m128.
STEP 1.
We do not need the Y coordinate at all, so we set up pointers to pairs of X and Z. Supply at least 4 pairs (i.e. 4 triangles total) per call. I'll call each X and Z pair a vertex.
STEP 2.
Use MOVAPS (requires the pointers to be aligned to 16-bit) to load the first two vertices pointed to by each pointer into registers.
A register loaded from a will look like this: [ a0.x, a0.z, a1.x, a1.z ]
STEP 3.
Now, using a single subtract instruction, you can calculate the deltas (your v0, v1, v2) for 2 vertices at once.
Calculate v0, v1 and v2 not only for the first 2 triangles, but also for the latter 2!
As I said you should supply a total of 4 vertices, or 8 floats, per input. Just repeat steps 2 and 3 for that data.
Now we have 2 pairs of vx registers, each pair containing the result for 2 triangles. I'll refer to them as vx_0 (first pair) and vx_1 (second pair) where x is from 0 to 2.
STEP 4.
Dot products. In order to parallelize the barycentric calculation (later on) we require the result of each dot product for each of the 4 triangles, in 1 single register.
So where you'd calculate dot01 for example, we will do the same, but for 4 triangles at once. Each v-register contains the result for 2 vectors, so we begin by multiplying them.
Let us say that u and v -- parameters in your dot product function -- are now v0_0 and v1_0 (as to calculate dot01):
Multiply u and v to get: [ (v0_0.x0) * (v1_0.x0), (v0_0.z0) * (v1_0.z0), (v0_0.x1) * (v1_0.x1), (v0_0.z1) * (v1_0.z1) ]
This may look confusing because of .x0 / .z0 and .x1 / .z1, but look at what was loaded at step 2: a0, a1.
If by now this still feels fuzzy, pick up a piece of paper and write along from the start.
Next, still for the same dot product, do the multiplication for v0_1 and v1_1 (i.e. the second pair of triangles).
Now we have 2 registers with 2 X & Z pairs each (4 vertices total), multiplied and ready to be added together to form 4 separate dot products. SSE3 has an instruction to do exactly this, and it's called HADDPS:
xmm0 = [A, B, C, D]
xmm1 = [E, F, G, H]
HADDPS xmm0, xmm1 does this:
xmm0 = [A+B, C+D, E+F, G+H]
It will take the X & Z pairs from our first register, those from the second, add them together and store them in the first, second, third and fourth component of the destination register. Ergo: at this point we've got this particular dot product for all 4 triangles!
**Now repeat this process for all the dot products: dot00 et cetera. **
STEP 5.
The last calculation (as far as I could see by the supplied code) is the barycentric stuff. This is a 100% scalar calculation in your code. But your inputs now are not scalar dot product results (i.e. single floats), they are SSE vectors/registers with a dot product for each of the 4 triangles.
So if you flat out vectorize this by using the parallel SSE operations that operate on all 4 floats, you'll eventually end up with 1 register (or result) carrying 4 heights, 1 for each triangle.
Since my lunch break is well past due I'm not going to spell this one out, but given the setup/idea I've given this is a last step and shouldn't be hard to figure out.
I know that this idea is a bit of a stretch and requires some loving from the code that sits above it and perhaps some quality time with pencil and paper, but it will be fast (and you can even add OpenMP afterwards if you'd like).
Good luck :)
(and forgive my fuzzy explanation, I can whip up the function if need be =))
UPDATE
I've written an implementation and it did not go as I expected, mainly because the Y component got involved beyond the piece of code that you initially pasted in your question ( I looked it up). What I've done here is not just ask you to rearrange all points in to XZ pairs and feed them per 4, but also feed 3 pointers (for points A, B and C) with the Y values for each of the 4 triangles. From a local perspective this is fastest. I can still modify it to require less intrusive changes from the callee's end, but please let me know what's desirable.
Then a disclaimer: this code is straightforward as hell (something I've found to work pretty well with compilers in terms of SSE... they can reorganize as see fit and x86/x64 CPUs take their share in it too). Also the naming, it's not my style, it's not anyone's, just do with it what you see fit.
Hope it helps and if not I'll gladly go over it again. And if this is a commercial project there's also the option of getting me on board I guess ;)
Anyway, I've put it on pastebin: http://pastebin.com/20u8fMEb