What do the values in a CVPixelBuffer mean that is returned from a VNGeneratePersonSegmentationRequest?

Question

I use Apple's Vision Framework to create a matte image for a person that is found in a user provided image. I want to analyze the results from a VNGeneratePersonSegmentationRequest to understand, for example, if the request has found a person at all and if so, how large the resulting mask is relative to the source image (either the mask's extent or the number of opaque pixels).

The result of a VNGeneratePersonSegmentationRequest is a VNPixelBufferObservation, and apparently it doesn't support a confidence level (confidence is always 1.0) or a result count like numberOfFoundPeople.

What I try instead is to analyze the resulting CVPixelBuffer directly. I obtain it like this:

let personSegmentationRequest = VNGeneratePersonSegmentationRequest()
personSegmentationRequest.outputPixelFormat = kCVPixelFormatType_OneComponent8

let requestHandler = VNImageRequestHandler(url: imageUrl)
try requestHandler.perform([personSegmentationRequest])

let mask = personSegmentationRequest.results![0]
let maskBuffer = mask.pixelBuffer
CVPixelBufferLockBaseAddress(maskBuffer, .readOnly)
defer {
  CVPixelBufferUnlockBaseAddress(maskBuffer, .readOnly)
}

My idea now is to look at the individual pixel values of the buffer. I assumed that I could get the mask's size with CVPixelBufferGetWidth and CVPixelBufferGetHeight, and get one byte per pixel whereas a 0 value means "fully transparent" and 255 means "fully opaque".

Apparently, that's not correct: The pixel buffer has always a size of 2016x1512 or 1512x2016, but CVPixelBufferGetBytesPerRow returns either 2048 or 1536, so I have some extra bytes per row. How does this add up? CVPixelBufferGetExtendedPixels returns 0 for all directions, so there is no padding.

Also, if I look at the first few values in the buffer, they're not what I expect. Here's my code to print the first 10 values of the buffer:

let baseAddress = CVPixelBufferGetBaseAddress(maskBuffer)!
let pointer = baseAddress.assumingMemoryBound(to: UInt8.self)
print((0...10).map({ String(pointer[$0]) }).joined(separator: ","))

Here are example outputs for images that contain either a person in the center of the image or no person at all:

1,0,0,0,0,0,0,0,0,0,0

4,1,0,0,0,0,0,0,0,0,0

9,4,1,1,1,0,0,0,0,0,0

2,1,1,1,0,0,0,0,0,0,0

0,0,0,0,0,0,0,0,0,0,0

The values should correspond to the pixels at a corner of the source image, and I would always expect all zeroes for my example images.

What's strange is that when I ignore these results and simply create a CIImage with this CVPixelBuffer, rescale it and apply it as a mask using CoreImage (as in Apple's example code), the result looks correct, and I do not see semi-transparent pixels in the corners.

What's going on? Do I misunderstand the CVPixelBufferGet* methods? Does the data in the pixel buffer contain metadata?

score 1 · Answer 1 · answered Aug 04 '22 at 14:50

After some more research, I assume that the difference between the number of bytes per row and the pixel buffer's width arises from a required byte alignment in Core Video (see this answer).

The small nonzero values are in fact visible when the buffer is applied as an image mask, but the values are small enough that the impact is barely noticeable. I would assume that they are artifacts of the ML model and they can be ignored.

What do the values in a CVPixelBuffer mean that is returned from a VNGeneratePersonSegmentationRequest?

1 Answers1