Using Vision and RealityKit Rotates Counterclockwise and Distorts(Stretches?) Video

Question

I am attempting to learn object detection in iOS, and then mark the place of the detected object. I have the model trained and installed in the project. My next step is to show an AR View on screen. That is working. When I turn my vision processing code on via a button, I end up with the image on screen rotated and distorted (most likely just stretching due to inverted axis).

I found a partial tutorial that I was using to help guide me, and they seem to have run into this issue, solved it, but did not show the solution. I have no way of contacting the author. The author's comment was: one slightly tricky aspect to this was that the coordinate system returned from Vision was different than SwiftUI’s coordinate system (normalized and the y-axis was flipped), but some simple transformations did the trick.

I have no idea which simple transformations they were, but I suspect they were simd related. If anyone has insight into this, I would appreciate solving the rotation and distortion issue.

I do have error codes that appear in the console as soon as Vision starts:

Messages similar to this:

2022-05-12 21:14:39.142550-0400 Find My Apple Remote[66143:9990936] [Assets] Resolving material name 'engine:BuiltinRenderGraphResources/AR/arInPlacePostProcessCombinedPermute7.rematerial' as an asset path -- this usage is deprecated; instead provide a valid bundle

2022-05-12 21:14:39.270684-0400 Find My Apple Remote[66143:9991089] [Session] ARSession <0x111743970>: ARSessionDelegate is retaining 11 ARFrames. This can lead to future camera frames being dropped.

2022-05-12 21:14:40.121810-0400 Find My Apple Remote[66143:9991117] [CAMetalLayer nextDrawable] returning nil because allocation failed.

The one that concerns me the most is the last one.

My code, so far, is:

struct ContentView : View {
    
    @State private var isDetecting = false
    @State private var success = false
    var body: some View {
        VStack {
            RealityKitView(isDetecting: $isDetecting, success: $success)
                .overlay(alignment: .top) {
                    Image(systemName: (success ? "checkmark.circle" : "slash.circle"))
                        .foregroundColor(success ? .green : .red)
                }
            Button {
                isDetecting.toggle()
            } label: {
                Text(isDetecting ? "Stop Detecting" : "Start Detecting")
                    .frame(width: 150, height: 50)
                    .background(
                        Capsule()
                            .fill(isDetecting ? Color.red.opacity(0.5) : Color.green.opacity(0.5))
                    )
            }
        }
    }
}

import SwiftUI
import ARKit
import RealityKit
import Vision

struct RealityKitView: UIViewRepresentable {
    
    let arView = ARView()
    let scale = SIMD3<Float>(repeating: 0.1)
    let model: VNCoreMLModel? = RealityKitView.returnMLModel()
    
    @Binding var isDetecting: Bool
    @Binding var success: Bool
    
    @State var boundingBox: CGRect?
    
    func makeUIView(context: Context) -> some UIView {
        
        // Start AR Session
        let session = configureSession()
        
        // Handle ARSession events via delegate
        session.delegate = context.coordinator
        
        return arView
    }
    
    func configureSession() -> ARSession {
        let session = arView.session
        let config = ARWorldTrackingConfiguration()
        config.planeDetection = [.horizontal, .vertical]
        config.environmentTexturing = .automatic
        session.run(config)
        return session
    }
    
    static func returnMLModel() -> VNCoreMLModel? {
        do {
            let detector = try AppleRemoteDetector()
            let model = try VNCoreMLModel(for: detector.model)
            return model
        } catch {
            print("RealityKitView:returnMLModel failed with error: \(error)")
        }
        return nil
    }
    
    func updateUIView(_ uiView: UIViewType, context: Context) {}
    
    func makeCoordinator() -> Coordinator {
        Coordinator(self)
    }
    
    class Coordinator: NSObject, ARSessionDelegate {
        var parent: RealityKitView
        
        init(_ parent: RealityKitView) {
            self.parent = parent
        }
        
        func session(_ session: ARSession, didUpdate frame: ARFrame) {
            // Start vision processing 
            if parent.isDetecting {
                guard let model = parent.model else {
                    return
                }
                // I suspect the problem is here where the image is captured in a buffer, and then
                // turned in to an input for the CoreML model.
                let pixelBuffer = frame.capturedImage
                let input = AppleRemoteDetectorInput(image: pixelBuffer)
                
                do {
                    let request = VNCoreMLRequest(model: model) { (request, error) in
                        guard
                            let results = request.results,
                            !results.isEmpty,
                            let recognizedObjectObservation = results as? [VNRecognizedObjectObservation],
                            let first = recognizedObjectObservation.first
                        else {
                            self.parent.boundingBox = nil
                            self.parent.success = false
                            return
                        }
                        self.parent.success = true
                        print("\(first.boundingBox)")
                        self.parent.boundingBox = first.boundingBox
                    }
                    
                    model.featureProvider = input
                    
                    let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: CGImagePropertyOrientation.right, options: [:])
                    try handler.perform([request])
                } catch {
                    print(error)
                }
            }
        }
    }
}

score 1 · Accepted Answer · answered May 15 '22 at 23:32

After days of trying to figure this out, with research and more research, I came across this question and answer that provides the solution. Please note that both answers are valid, it just depends upon the structure of your app.

The crux of the issue is that causing a state change in RealityKitView causes the ARView to be re-instantiated. However, this time, it is instantiated with a size of 0, and that is what causes the error[CAMetalLayer nextDrawable] returning nil because allocation failed as this causes it to return nil. However, initializing it with some size like this:

let arView = ARView(frame: .init(x: 1, y: 1, width: 1, height: 1), cameraMode: .ar, automaticallyConfigureSession: false)

resolves that issue.

For the sake of those who are attempting this in the future, here is the current working UIViewRepresentable:

import SwiftUI
import ARKit
import RealityKit
import Vision

struct RealityKitView: UIViewRepresentable {
    
    let arView = ARView(frame: .init(x: 1, y: 1, width: 1, height: 1), cameraMode: .ar, automaticallyConfigureSession: false)
    // Making this implicity unwrapped. If this fails, the app should crash anyway...
    let model: VNCoreMLModel! = RealityKitView.returnMLModel()
    
    @Binding var isDetecting: Bool // This turns Vision on and off
    @Binding var success: Bool // This is the state of Vision's finding the object
    @Binding var message: String // This allows different messages to be communicated to the user
    
    @State var boundingBox: CGRect?
    
    func makeUIView(context: Context) -> some UIView {
        
        // Start AR Session
        let session = configureSession()
        
        // Add coaching overlay
        addCoachingOverlay(session: session)
        
        // Handle ARSession events via delegate
        session.delegate = context.coordinator
        
        return arView
    }
    
    func addCoachingOverlay(session: ARSession) {
        let coachingOverlay = ARCoachingOverlayView()
        coachingOverlay.autoresizingMask = [.flexibleWidth, .flexibleHeight]
        coachingOverlay.session = session
        coachingOverlay.goal = .horizontalPlane
        arView.addSubview(coachingOverlay)
    }
    
    func configureSession() -> ARSession {
        let session = arView.session
        let config = ARWorldTrackingConfiguration()
        config.planeDetection = [.horizontal, .vertical]
        config.environmentTexturing = .automatic
        session.run(config)
        return session
    }
    
    static func returnMLModel() -> VNCoreMLModel? {
        do {
            let config = MLModelConfiguration()
            config.computeUnits = .all
            let detector = try AppleRemoteDetector()
            let model = try VNCoreMLModel(for: detector.model)
            return model
        } catch {
            print("RealityKitView:returnMLModel failed with error: \(error)")
        }
        return nil
    }
    
    func updateUIView(_ uiView: UIViewType, context: Context) {
        
    }
    
    func makeCoordinator() -> Coordinator {
        Coordinator(self)
    }
    
    class Coordinator: NSObject, ARSessionDelegate {
        var parent: RealityKitView
        
        init(_ parent: RealityKitView) {
            self.parent = parent
        }
        
        func session(_ session: ARSession, didUpdate frame: ARFrame) {
            if parent.isDetecting {
                // Do not enqueue other buffers for processing while another Vision task is still running.
                // The camera stream has only a finite amount of buffers available; holding too many buffers for analysis would starve the camera.
                guard currentBuffer == nil, case .normal = frame.camera.trackingState else {
                    return
                }
                
                // Retain the image buffer for Vision processing.
                self.currentBuffer = frame.capturedImage
                classifyCurrentImage()

            }
        }
        
        // MARK: - Vision classification
        
        // Vision classification request and model
        /// - Tag: ClassificationRequest
        private lazy var classificationRequest: VNCoreMLRequest = {
                // Instantiate the model from its generated Swift class.
                let request = VNCoreMLRequest(model: parent.model, completionHandler: { [weak self] request, error in
                    self?.processClassifications(for: request, error: error)
                })
                
                // Crop input images to square area at center, matching the way the ML model was trained.
                request.imageCropAndScaleOption = .scaleFill
                
                // Use CPU for Vision processing to ensure that there are adequate GPU resources for rendering.
                request.usesCPUOnly = true
                
                return request
        }()
        
        // The pixel buffer being held for analysis; used to serialize Vision requests.
        private var currentBuffer: CVPixelBuffer?
        
        // Queue for dispatching vision classification requests
        private let visionQueue = DispatchQueue(label: "com.alelin.Find-My-Apple-Remote.ARKitVision.serialVisionQueue")
        
        // Run the Vision+ML classifier on the current image buffer.
        /// - Tag: ClassifyCurrentImage
        private func classifyCurrentImage() {
            guard let currentBuffer = currentBuffer else {
                return
            }

            // Most computer vision tasks are not rotation agnostic so it is important to pass in the orientation of the image with respect to device.
            // This is an extension on CGImagePropertyOrientation
            let orientation = CGImagePropertyOrientation(UIDevice.current.orientation)
            
            let input = AppleRemoteDetectorInput(image: currentBuffer)
            
            parent.model.featureProvider = input
            let requestHandler = VNImageRequestHandler(cvPixelBuffer: currentBuffer, orientation: orientation, options: [:])
            
            visionQueue.async {
                do {
                    // Release the pixel buffer when done, allowing the next buffer to be processed.
                    defer { self.currentBuffer = nil }
                    try requestHandler.perform([self.classificationRequest])
                } catch {
                    print("Error: Vision request failed with error \"\(error)\"")
                }
            }
        }
        
        // Handle completion of the Vision request and choose results to display.
        /// - Tag: ProcessClassifications
        func processClassifications(for request: VNRequest, error: Error?) {
            guard
                let results = request.results,
                !results.isEmpty,
                let recognizedObjectObservations = results as? [VNRecognizedObjectObservation],
                let recognizedObjectObservation = recognizedObjectObservations.first,
                let bestResult = recognizedObjectObservation.labels.first(where: { result in result.confidence > 0.5 }),
                let label = bestResult.identifier.split(separator: ",").first
            else {
                self.parent.boundingBox = nil
                self.parent.success = false
                if let error = error {
                    print("Unable to classify image.\n\(error.localizedDescription)")
                }
                return
            }
            self.parent.success = true
            print("\(recognizedObjectObservation.boundingBox)")
            self.parent.boundingBox = recognizedObjectObservation.boundingBox

            // Show a label for the highest-confidence result (but only above a minimum confidence threshold).
            let confidence = String(format: "%.0f", bestResult.confidence * 100)
            let labelString = String(label)
            parent.message = "\(labelString) at \(confidence)"
        }
                
        func session(_ session: ARSession, didFailWithError error: Error) {
            guard error is ARError else { return }
            
            let errorWithInfo = error as NSError
            let messages = [
                errorWithInfo.localizedDescription,
                errorWithInfo.localizedFailureReason,
                errorWithInfo.localizedRecoverySuggestion
            ]
            
            // Filter out optional error messages.
            let errorMessage = messages.compactMap({ $0 }).joined(separator: "\n")
            DispatchQueue.main.async {
                self.parent.message = "The AR session failed with error: \(errorMessage)"
            }
        }
    }
}

Using Vision and RealityKit Rotates Counterclockwise and Distorts(Stretches?) Video

1 Answers1