Coordinate System of the BoundingBoxes reported in VNFaceObservation by Apple's Vision framework

Question

I'm trying to use Apple's Vision framework to detect Face BoundingBoxes using ARKit's Front-Facing Camera in Device's Portrait mode(ARFaceTrackingConfiguration).

Its unclear to me if the reported Face BoundingBoxes are in the coordinate space of the original un-rotated buffer as captured by the camera sensor or the rotated buffer that iOS's Vision framework operates on

To account for the fact that the camera is mounted landscape on iOS devices, I passed in raw-frame provided by ARKit(which is 90CCW rotated) and .right when calling VNImageRequestHandler.perform(on:, orientation)

private let requestHandler = VNSequenceRequestHandler()
private var facePoseRequest: VNDetectFaceRectanglesRequest!

//...

//orientation is evaluated as .right based on UIDevice's orientation below
try? self.requestHandler.perform([self.facePoseRequest], on: currentBuffer, orientation: orientation)
guard let faceRes = self.facePoseRequest.results?.first as? VNFaceObservation else {
  return
}

Since I passed in .right for orientation param, Im assuming Vision framework would have rotated the image 90CW to make it upright before performing inference

Question: Is faceRes.boundingBox in the coordinate system of the original unrotated pixel buffer or the in the coordinate system of what Vision's internal model sees (i.e orientation-fixed/rotated buffer)?

Initially I assumed it to be in the rotated buffer, but I'm seeing that the reported normalized BB has width > height. When I went ahead and converted the reported BB to image-space using VNImageRectForNormalizedRect, Im seeing that the resultant BB is a square i.e "width" == "height". This could be due to the fact the BB's width/height are in the same aspect ratio as the raw-frame

//Assumption-1: Evaluate BB assuming it is reported in "rotated" image's coord system
let flippedImg = CIImage(cvPixelBuffer: currentBuffer).oriented(.right)
let imgBB = VNImageRectForNormalizedRect(faceRes.boundingBox, Int(flippedImg.extent.width), Int(flippedImg.extent.height))

//draw "imgBB" on the "flippedImg" above

I then tried to draw it on the unrotated raw frame reported by ARKit's callback. While the aspect ratio of BB appears right i.e height > width, BB jumps around when I tilt the device or my head and aligns with the actual face only in one certain angle.

//Assumption-2: Evaluate BB assuming it is reported in "unrotated" image's coord system
let rawImg = CIImage(cvPixelBuffer: currentBuffer) //
let imgBB = VNImageRectForNormalizedRect(faceRes.boundingBox, Int(rawImg.extent.width), Int(rawImg.extent.height))

//draw "imgBB" on "rawImg" above

Note-1: Interestingly, this (BB jumping around) doesn't happen with the previous approach/assumption i.e in the previous approach, BB drawn on the image, always sticks to the face, but it's square in shape and doesn't cover the entire face (only from below eyes to the chin)

Note-2: I also tried to account for "Y flipped-ness" of the Vision framework by doing something like this, before passing it to VNImageRectForNormalizedRect: But this doesn't seem to help either

let flippedNormBB = CGRect(x: faceRes.boundingBox.origin.x,
                                       y: 1.0-faceRes.boundingBox.origin.y,
                                       width: faceRes.boundingBox.size.width,
                                       height: faceRes.boundingBox.size.height)

Thanks in advance for any pointers in clearing this confusion.

Coordinate System of the BoundingBoxes reported in VNFaceObservation by Apple's Vision framework

0 Answers0