0

I've read ARKit official tutorial RealtimeNumberReader, it uses AVCaptureSession and a specific function layerRectConverted which is only for AVCaptureSession to convert coordinates from bounding box to screen coordinate.

let rect = layer.layerRectConverted(fromMetadataOutputRect: box.applying(self.visionToAVFTransform))

Now I want to recognize text on ARFrame's capturedImage and then display the bound box on screen. Is it possible?

I know how to recognize text on a single image from official tutorial, my problem is how to convert the normalized box coordinate to viewport coordinate.

Please help and thank you very much!!!

2 Answers2

3

Based on @Banane42's answer, I found the theory behind ARkit and VNRecognizeTextRequest

  1. ARKit Sceneview's capturedImage is wider than you can see. check the picture below. I made a small app that has an imageView to display the whole image, and the background image is the sceneview area.
    an image to proof arkit's capturedimage size
  1. The coordinate of sceneview or image is originated from top left corner, x-axis -> to right and y-axis -> to bottom. But the coordinate of boundingBox that VNRequest returns is originated from bottom left corner and x-axis -> to right and y-axis -> to top.

  2. if you use request.regionOfInterest, this ROI should be normalized coordinate with respect to the whole image. the returned VNRequest boundingBox is in normalized coordinate with respect to the ROI box.

Finally I've got my app working properly. And this is very complicated. so be careful! enter image description here

1

Try looking at this git repo. Having messed with it myself it is not the most performant but this should give you a start.

Banane42
  • 86
  • 2
  • 8
  • 1
    Thank you very much! Based on your example, I've finally figured out the solution. I'll make another answer to explain for other people who may need. – Chengxing Zhang Mar 30 '21 at 14:41