You can implement your own simple reductions using the KNC vpermd and vpermf32x4 instructions as well as the swizzle modifiers to do cross lane operations inside the vector units.
The C intrinsic function equivalents of these would be the mm512{mask}permute* and mm512{mask}swizzle* family.
However, I recommend that you first look at the array notation reduce operations, that already have high performance implementations on the MIC.
Look at the reduction operations available here and also check out this video by Taylor Kidd from Intel talking about array notation reductions on the Xeon Phi starting at 20mins 30s.
EDIT: I noticed you are also looking for CPU based solutions. The array notation reductions will work very well on the Xeon also.