1

I am trying to use Vision and CoreML to perform style transfer on tracked objects in as-close-to-real-time as possible. I am using AVKit to capture video, and AVCaptureVideoDataOutputSampleBufferDelegate to get each frame.

At a high level, my pipeline is:

1) detect faces

2) update the preview layers to draw bounding boxes at the proper screen location

3) crop the original image to the detected faces

4) run face images through coreML model, and get new images as output

5) fill the preview layers (wherever they are) with the new images

I was hoping to place the bounding boxes as soon as they were computed(on the main thread), and then fill them once inference was done. However, I've found that adding the coreML inference to the pipeline(on the AVCaptureOutputQueue or a CoreMLQueue), the bounding boxes do not update positions until the inference is complete. Maybe I am missing something with how queues are handled in closures. The (hopefully) relevant parts of the code are below.

I'm modifying the code from https://developer.apple.com/documentation/vision/tracking_the_user_s_face_in_real_time.

public func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer,
    from connection: AVCaptureConnection) {
    // omitting stuff that gets pixelBuffers etc formatted for use with Vision
    // and sets up tracking requests

    // Perform landmark detection on tracked faces
    for trackingRequest in newTrackingRequests {

        let faceLandmarksRequest = VNDetectFaceLandmarksRequest(completionHandler: { (request, error) in

            guard let landmarksRequest = request as? VNDetectFaceLandmarksRequest,
                let results = landmarksRequest.results as? [VNFaceObservation] else {
                    return
            }

            // Perform all UI updates (drawing) on the main queue,
            //not the background queue on which this handler is being called.

            DispatchQueue.main.async {
                self.drawFaceObservations(results) //<<- places bounding box on the preview layer
            }

            CoreMLQueue.async{ //Queue for coreML uses

                //get region of picture to crop for CoreML
                let boundingBox = results[0].boundingBox 

                //crop the input frame to the detected object
                let image: CVPixelBuffer = self.cropFrame(pixelBuffer: pixelBuffer, region: boundingBox)

                //infer on region
                let styleImage: CGImage = self.performCoreMLInference(on: image)

                //on the main thread, place styleImage into the bounding box(CAShapeLayer)
                DispatchQueue.main.async{
                    self.boundingBoxOverlayLayer?.contents = styleImage
                }
            }
        })

        do {
            try requestHandler.perform(faceLandmarksRequest)
        } catch let error as NSError {
            NSLog("Failed Request: %@", error)
        }
    }
}

Beyond a queue/synchrony issue, I was thinking one cause for the slowdown could be cropping the pixel buffer to the region of interest. Im out of ideas here, any help would be appreciated

tyrotyrotyro
  • 55
  • 1
  • 7

1 Answers1

0

I am using a pipeline of https://github.com/maxvol/RxAVFoundation and https://github.com/maxvol/RxVision to address the synchronization issues.

A basic example -

let textRequest: RxVNDetectTextRectanglesRequest<CVPixelBuffer> = VNDetectTextRectanglesRequest.rx.request(reportCharacterBoxes: true)
var session = AVCaptureSession.rx.session()
var requests = [RxVNRequest<CVPixelBuffer>]()

self.requests = [self.textRequest]
self
  .textRequest
  .observable
  .observeOn(Scheduler.main)
  .subscribe { [unowned self] (event) in
      switch event {
      case .next(let completion):
              self.detectTextHandler(value: completion.value, request: completion.request, error: completion.error)
          default:
          break
      }
  }
  .disposed(by: disposeBag)

self.session
  .flatMapLatest { [unowned self] (session) -> Observable<CaptureOutput> in
      let imageLayer = session.previewLayer
      imageLayer.frame = self.imageView.bounds
      self.imageView.layer.addSublayer(imageLayer)
      return session.captureOutput
  }
  .subscribe { [unowned self] (event) in
      switch event {
      case .next(let captureOutput):
          guard let pixelBuffer = CMSampleBufferGetImageBuffer(captureOutput.sampleBuffer) else {
              return
          }
          var requestOptions: [VNImageOption: Any] = [:]
          if let camData = CMGetAttachment(captureOutput.sampleBuffer, key: kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, attachmentModeOut: nil) {
              requestOptions = [.cameraIntrinsics: camData]
          }
          let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: .up, options: requestOptions)
          do {
              try imageRequestHandler.rx.perform(self.requests, with: pixelBuffer)
          } catch {
              os_log("error: %@", "\(error)")
          }
          break
      case .error(let error):
          os_log("error: %@", "\(error)")
          break
      case .completed:
          // never happens
          break
      }
  }
  .disposed(by: disposeBag)
Maxim Volgin
  • 3,957
  • 1
  • 23
  • 38