Tracking eyes with Vision framework

Question

How can you use the new Vision framework in iOS 11 to track eyes in a video while the head or camera is moving? (using the front camera).

I've found VNDetectFaceLandmarksRequest to be very slow on my iPad - landmarks requests are performed roughly once in 1-2 seconds. I fee like I'm doing something wrong, but there is not much documentation on Apple's site.

I've already watched the WWDC 2017 video on Vision:

https://developer.apple.com/videos/play/wwdc2017/506/

as well as read this guide:

https://github.com/jeffreybergier/Blog-Getting-Started-with-Vision

My code looks roughly like this right now (sorry, it's Objective-C):

// Capture session setup

- (BOOL)setUpCaptureSession {
    AVCaptureDevice *captureDevice = [AVCaptureDevice
                                      defaultDeviceWithDeviceType:AVCaptureDeviceTypeBuiltInWideAngleCamera
                                      mediaType:AVMediaTypeVideo
                                      position:AVCaptureDevicePositionFront];
    NSError *error;
    AVCaptureDeviceInput *captureInput = [AVCaptureDeviceInput deviceInputWithDevice:captureDevice error:&error];
    if (error != nil) {
        NSLog(@"Failed to initialize video input: %@", error);
        return NO;
    }

    self.captureOutputQueue = dispatch_queue_create("CaptureOutputQueue",
                                                    DISPATCH_QUEUE_SERIAL);

    AVCaptureVideoDataOutput *captureOutput = [[AVCaptureVideoDataOutput alloc] init];
    captureOutput.alwaysDiscardsLateVideoFrames = YES;
    [captureOutput setSampleBufferDelegate:self queue:self.captureOutputQueue];

    self.captureSession = [[AVCaptureSession alloc] init];
    self.captureSession.sessionPreset = AVCaptureSessionPreset1280x720;
    [self.captureSession addInput:captureInput];
    [self.captureSession addOutput:captureOutput];

    return YES;
}

// Capture output delegate:

- (void)captureOutput:(AVCaptureOutput *)output
didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer
       fromConnection:(AVCaptureConnection *)connection {
    if (!self.detectionStarted) {
        return;
    }

    CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer);
    if (pixelBuffer == nil) {
        return;
    }

    NSMutableDictionary<VNImageOption, id> *requestOptions = [NSMutableDictionary dictionary];
    CFTypeRef cameraIntrinsicData = CMGetAttachment(sampleBuffer,
                                                    kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix,
                                                    nil);
    requestOptions[VNImageOptionCameraIntrinsics] = (__bridge id)(cameraIntrinsicData);

    // TODO: Detect device orientation
    static const CGImagePropertyOrientation orientation = kCGImagePropertyOrientationRight;

    VNDetectFaceLandmarksRequest *landmarksRequest =
        [[VNDetectFaceLandmarksRequest alloc] initWithCompletionHandler:^(VNRequest *request, NSError *error) {
            if (error != nil) {
                NSLog(@"Error while detecting face landmarks: %@", error);
            } else {
                dispatch_async(dispatch_get_main_queue(), ^{
                    // Draw eyes in two corresponding CAShapeLayers
                });
            }
        }];


    VNImageRequestHandler *requestHandler = [[VNImageRequestHandler alloc] initWithCVPixelBuffer:pixelBuffer
                                                                                     orientation:orientation
                                                                                         options:requestOptions];
    NSError *error;
    if (![requestHandler performRequests:@[landmarksRequest] error:&error]) {
        NSLog(@"Error performing landmarks request: %@", error);
        return;
    }
}

Is it right to call -performRequests:.. on the same queue as the video output? Based on my experiments this method seems to call the request's completion handler synchronously. Should I not call this method on every frame?

To speed things up I've also tried using VNTrackObjectRequest to track each eye separately after landmarks were detected on the video (by constructing a bounding box from landmarks' region points), but that didn't work very well (still trying to figure it out).

What is the best strategy for tracking eyes on a video? Should I track a face rectangle and then execute a landmarks request inside its area (will it be faster)?

What kind of iPad do you have? How are you rendering the image? I tried similar code here: https://gist.github.com/bgayman/f98c3738ef85317def30f06b77696479 and performed reasonable on an iPhone 7 plus. If you only want eye position you should consider using core image. As was noted on page 72 of the slides in _Vision Framework: Building on Core ML_, core image will perform better (in terms of speed and responsiveness) and give you some facial landmarks including eyes and whether they are open or closed. — beyowulf, Nov 03 '17 at 21:48
@beyowulf I have iPad mini 2, is it too slow for this? I will try Core Image, thanks. In the presentation they said that Vision had better accuracy so that's what I thought would work better. — iosdude, Nov 04 '17 at 06:21
Yeah, iPads mini tend to be somewhat underpowered in this regard. Vision will definitely give you more landmarks and is able to more accurately recognize faces, but that comes at the expensive of greater latency. — beyowulf, Nov 04 '17 at 16:30

Tracking eyes with Vision framework

0 Answers0