I use Apple's Vision Framework to create a matte image for a person that is found in a user provided image. I want to analyze the results from a VNGeneratePersonSegmentationRequest
to understand, for example, if the request has found a person at all and if so, how large the resulting mask is relative to the source image (either the mask's extent or the number of opaque pixels).
The result of a VNGeneratePersonSegmentationRequest
is a VNPixelBufferObservation
, and apparently it doesn't support a confidence level (confidence
is always 1.0
) or a result count like numberOfFoundPeople
.
What I try instead is to analyze the resulting CVPixelBuffer
directly. I obtain it like this:
let personSegmentationRequest = VNGeneratePersonSegmentationRequest()
personSegmentationRequest.outputPixelFormat = kCVPixelFormatType_OneComponent8
let requestHandler = VNImageRequestHandler(url: imageUrl)
try requestHandler.perform([personSegmentationRequest])
let mask = personSegmentationRequest.results![0]
let maskBuffer = mask.pixelBuffer
CVPixelBufferLockBaseAddress(maskBuffer, .readOnly)
defer {
CVPixelBufferUnlockBaseAddress(maskBuffer, .readOnly)
}
My idea now is to look at the individual pixel values of the buffer. I assumed that I could get the mask's size with CVPixelBufferGetWidth
and CVPixelBufferGetHeight
, and get one byte per pixel whereas a 0
value means "fully transparent" and 255
means "fully opaque".
Apparently, that's not correct: The pixel buffer has always a size of 2016x1512
or 1512x2016
, but CVPixelBufferGetBytesPerRow
returns either 2048
or 1536
, so I have some extra bytes per row. How does this add up? CVPixelBufferGetExtendedPixels
returns 0
for all directions, so there is no padding.
Also, if I look at the first few values in the buffer, they're not what I expect. Here's my code to print the first 10 values of the buffer:
let baseAddress = CVPixelBufferGetBaseAddress(maskBuffer)!
let pointer = baseAddress.assumingMemoryBound(to: UInt8.self)
print((0...10).map({ String(pointer[$0]) }).joined(separator: ","))
Here are example outputs for images that contain either a person in the center of the image or no person at all:
1,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0
9,4,1,1,1,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0
The values should correspond to the pixels at a corner of the source image, and I would always expect all zeroes for my example images.
What's strange is that when I ignore these results and simply create a CIImage
with this CVPixelBuffer
, rescale it and apply it as a mask using CoreImage
(as in Apple's example code), the result looks correct, and I do not see semi-transparent pixels in the corners.
What's going on? Do I misunderstand the CVPixelBufferGet*
methods? Does the data in the pixel buffer contain metadata?