Frame information in completion handler for text detection in ARSession

Question

I am using Core Vision to detect text boxes in an ARKit session, my problem is accessing the frame to perform a hit test once I have detected the boxes.

func startTextDetection() {
    let textRequest = VNDetectTextRectanglesRequest(completionHandler: self.detectTextHandler)
    textRequest.reportCharacterBoxes = true
    self.requests = [textRequest]
}

func detectTextHandler(request: VNRequest, error: Error?) {
    guard let observations = request.results else {
        print("no result")
        return
    }

    let result = observations.map({$0 as? VNTextObservation})
    for box in result {
        let hit = frame.hitTest(box?.topRight - box?.bottomLeft, types: ARHitTestResult.ResultType.featurePoint )
        let anchor = ARAnchor(transform:hit.worldTransform)
        sceneView.session.add(anchor:anchor)
    }
    //DispatchQueue.main.async() {

    //}
}

Ideally I would pass it to the completion handler from the ARSession delegate method but although the documentation says I can pass a completion handler here, I hav not found a way to do it.

func session(_ session: ARSession, didUpdate frame: ARFrame) {
    // Retain the image buffer for Vision processing.
    let pixelBuffer = frame.capturedImage
    let requestOptions:[VNImageOption : Any] = [:]

    let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: CGImagePropertyOrientation.up, options: requestOptions)

    do {
        try imageRequestHandler.perform(self.requests)
    } catch {
        print(error)
    }
}

I can keep a dictionary and look it up but it is not really elegant and it is prone to bugs and leaks. I would rather pass the relevant frame where I request the text detection.

Any ideas?

M Reza · Accepted Answer · 2019-05-21T11:49:46.807

Why don't you use your session's currentFrame property inside the completion handler? It contains the current frame of the session. Plus you don't need to pass any frame instance to your completion handler anymore. It is simply accessible using your sceneView instance.

So you can change your completion handler like below:

func detectTextHandler(request: VNRequest, error: Error?) {
    guard let currentFrame = sceneView.session.currentFrame else { return }
    ...
    // perform hit test using currentFrame
    let hit = currentFrame.hitTest(box?.topRight - box?.bottomLeft, types: ARHitTestResult.ResultType.featurePoint ) 
    ...
}

You can use currentFrame to create the image request handler in session(_:didUpdate:) as well:

let pixelBuffer = sceneView.currentFrame.capturedImage

Also, note that firing perform() method of VNImageRequestHandler in session(_:didUpdate:) is not efficient and takes so much process since it's running all the time, you could use a Timer instead to reduce amounts of time you perform image detection process.

Edit: Since image detection is async and might take time to finish, you can store the frame in another instance when making request, and use that instance inside completion handler:

var detectionFrame: ARFrame?

// Timer block
detectionFrame = sceneView.session.currentFrame
let pixelBuffer = detectionFrame.capturedImage
// image detection request code


func detectTextHandler(request: VNRequest, error: Error?) {
    guard let frame = detectionFrame else { return }
    ...
    let hit = frame.hitTest(box?.topRight - box?.bottomLeft, types: ARHitTestResult.ResultType.featurePoint ) 
    ...
}

I thought about it, but depending on process time / latency the current frame could be different from the frame that was used to detect the text. Good idea on the timer. — Daniele Bernardini, May 21 '19 at 11:24
@DanieleBernardini it won’t be different if you use currentFrame in both sides I guess? — M Reza, May 21 '19 at 11:32
the detection is async, so if the detection takes half a second and in the meantime I turned the camera it would — Daniele Bernardini, May 21 '19 at 11:33
@DanieleBernardini you're right. I updated my answer with a possible solution. — M Reza, May 21 '19 at 11:43
I came to the same conclusion, as I am not processing more than a frame at the time (doh!) I can just use the detectionFrame as lock to prevent new detection to start and clear it once the detection has completed — Daniele Bernardini, May 21 '19 at 11:52
On a different topic, there is no tag for coreml vision framework ... I can't create i yet but you can ;-) — Daniele Bernardini, May 21 '19 at 11:54

Frame information in completion handler for text detection in ARSession

1 Answers1