How can you use the new Vision framework in iOS 11 to track eyes in a video while the head or camera is moving? (using the front camera).
I've found VNDetectFaceLandmarksRequest
to be very slow on my iPad - landmarks requests are performed roughly once in 1-2 seconds. I fee like I'm doing something wrong, but there is not much documentation on Apple's site.
I've already watched the WWDC 2017 video on Vision:
https://developer.apple.com/videos/play/wwdc2017/506/
as well as read this guide:
https://github.com/jeffreybergier/Blog-Getting-Started-with-Vision
My code looks roughly like this right now (sorry, it's Objective-C):
// Capture session setup
- (BOOL)setUpCaptureSession {
AVCaptureDevice *captureDevice = [AVCaptureDevice
defaultDeviceWithDeviceType:AVCaptureDeviceTypeBuiltInWideAngleCamera
mediaType:AVMediaTypeVideo
position:AVCaptureDevicePositionFront];
NSError *error;
AVCaptureDeviceInput *captureInput = [AVCaptureDeviceInput deviceInputWithDevice:captureDevice error:&error];
if (error != nil) {
NSLog(@"Failed to initialize video input: %@", error);
return NO;
}
self.captureOutputQueue = dispatch_queue_create("CaptureOutputQueue",
DISPATCH_QUEUE_SERIAL);
AVCaptureVideoDataOutput *captureOutput = [[AVCaptureVideoDataOutput alloc] init];
captureOutput.alwaysDiscardsLateVideoFrames = YES;
[captureOutput setSampleBufferDelegate:self queue:self.captureOutputQueue];
self.captureSession = [[AVCaptureSession alloc] init];
self.captureSession.sessionPreset = AVCaptureSessionPreset1280x720;
[self.captureSession addInput:captureInput];
[self.captureSession addOutput:captureOutput];
return YES;
}
// Capture output delegate:
- (void)captureOutput:(AVCaptureOutput *)output
didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer
fromConnection:(AVCaptureConnection *)connection {
if (!self.detectionStarted) {
return;
}
CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer);
if (pixelBuffer == nil) {
return;
}
NSMutableDictionary<VNImageOption, id> *requestOptions = [NSMutableDictionary dictionary];
CFTypeRef cameraIntrinsicData = CMGetAttachment(sampleBuffer,
kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix,
nil);
requestOptions[VNImageOptionCameraIntrinsics] = (__bridge id)(cameraIntrinsicData);
// TODO: Detect device orientation
static const CGImagePropertyOrientation orientation = kCGImagePropertyOrientationRight;
VNDetectFaceLandmarksRequest *landmarksRequest =
[[VNDetectFaceLandmarksRequest alloc] initWithCompletionHandler:^(VNRequest *request, NSError *error) {
if (error != nil) {
NSLog(@"Error while detecting face landmarks: %@", error);
} else {
dispatch_async(dispatch_get_main_queue(), ^{
// Draw eyes in two corresponding CAShapeLayers
});
}
}];
VNImageRequestHandler *requestHandler = [[VNImageRequestHandler alloc] initWithCVPixelBuffer:pixelBuffer
orientation:orientation
options:requestOptions];
NSError *error;
if (![requestHandler performRequests:@[landmarksRequest] error:&error]) {
NSLog(@"Error performing landmarks request: %@", error);
return;
}
}
Is it right to call -performRequests:..
on the same queue as the video output? Based on my experiments this method seems to call the request's completion handler synchronously. Should I not call this method on every frame?
To speed things up I've also tried using VNTrackObjectRequest
to track each eye separately after landmarks were detected on the video (by constructing a bounding box from landmarks' region points), but that didn't work very well (still trying to figure it out).
What is the best strategy for tracking eyes on a video? Should I track a face rectangle and then execute a landmarks request inside its area (will it be faster)?