Text recognition from a live video stream using ML kit (with CMSampleBuffer)

Question

I'm trying to modify the on-device text recognition example provided by Google here to make it work with a live camera feed.

When holding the camera over text (that works with the image example) my console produces the following in a stream before ultimately running out of memory:

2018-05-16 10:48:22.129901+1200 TextRecognition[32138:5593533] An empty result returned from from GMVDetector for VisionTextDetector.

This is my video capture method:

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {

        if let textDetector = self.textDetector {

            let visionImage = VisionImage(buffer: sampleBuffer)
            let metadata = VisionImageMetadata()
            metadata.orientation = .rightTop
            visionImage.metadata = metadata

            textDetector.detect(in: visionImage) { (features, error) in
                guard error == nil, let features = features, !features.isEmpty else {
                    // Error. You should also check the console for error messages.
                    // ...
                    return
                }

                // Recognized and extracted text
                print("Detected text has: \(features.count) blocks")
                // ...
            }

        }

    }

Is this the right way to do it?

there must be a gotcha here somewhere, I am having the same issue as this question https://stackoverflow.com/questions/50246800/firebase-mlkit-text-recognition-error and your question looks related, it would be great if one of the firebase people read this:) — Jason, May 16 '18 at 00:09
@dave, at this moment, the SDK can only accept upright image. Is your image rotated? It's stated in the developer document (Search for "Create a VisionImage object using a UIImage or a CMSampleBufferRef." in https://firebase-dot-devsite.googleplex.com/docs/ml-kit/ios/recognize-text#1-run-the-text-detector) — Isabella Chen, May 16 '18 at 00:23
hi @IsabellaChen, the camera is in portrait mode but the empty result message appears regardless of the orientation — dave, May 16 '18 at 01:15
@IsabellaChen is there a working example available using a live video feed for text detection? i'm finding that i can detect barcodes using the barcode detector in a live video feed, but if i use the same approach for text recognition i get the above error — dave, May 16 '18 at 02:42
@dave, I will double check for you tomorrow we have the right sample in Firebase Quick Start. But unfortunately, like I mentioned, the SDK cannot handle rotated image at this moment (the rotation hint you passed into VisionImageMetaData will not be respected). You have to rotate the CMSampleBuffer yourself to make the text upright in the image (Max rotation angle for text is 45 degrees for the text detection to work. Barcode detector probably works w/ rotated images). To test this out quickly, you can rotate your device to landscape(Right) mode to see whether it works. — Isabella Chen, May 16 '18 at 05:20
Thank you @IsabellaChen - I’ve tried rotating the phone in every orientation but I’m not seeing any results. Perhaps I should try converting the pixel buffer into a UIImage? — dave, May 16 '18 at 07:40
@IsabellaChen I think this is a bug as well. As I am having the same issue. No matter what orientation the device is, and whether I rotate the orientation of the UIImage programatically, I always get nil for the results. — BlackMirrorz, May 16 '18 at 09:33
@Josh Robbins (& dave), thanks for reporting. Let us look into this and then get back to you two. — Isabella Chen, May 16 '18 at 16:40
@Josh Robbins (& dave), I posted some Objective C code snippets below using CMSampleBuffer and it should work. Could you try it out? If it still doesn't work for you two, could you share 1) your device type 2) did you set any value for kCVPixelBufferPixelFormatTypeKey? 3) the first format of availableVideoCVPixelFormatTypes for your device. Thanks. — Isabella Chen, May 17 '18 at 22:48

Dong Chen · Accepted Answer · 2021-09-06T00:46:06.880

7

ML Kit has long migrated out of Firebase and became a standalone SDK (migration guide).

The Quick Start sample app in Swift showing how to do text recognition from a live video stream using ML Kit (with CMSampleBuffer) is now available here:

https://github.com/googlesamples/mlkit/tree/master/ios/quickstarts/textrecognition/TextRecognitionExample

The live feed is implemented in the CameraViewController.swift:

https://github.com/googlesamples/mlkit/blob/master/ios/quickstarts/textrecognition/TextRecognitionExample/CameraViewController.swift

edited Sep 06 '21 at 00:46

answered Jun 18 '18 at 22:32

Dong Chen

829
4
7

I've updated this three-plus-year-old answer to reflect the latest state of ML Kit. – Dong Chen Sep 06 '21 at 00:47
I think this example has been moved to a file combined with facial recognition, barcode recognition, etc.: https://github.com/googlesamples/mlkit/blob/master/ios/quickstarts/vision/VisionExample/CameraViewController.swift – btraas Mar 25 '23 at 02:28

score 2 · Answer 2 · answered May 17 '18 at 21:53

ML Kit is still in the process of adding sample code for CMSampleBuffer usage to Firebase Quick Start.

In the meantime, below code works for CMSampleBuffer.

Set up AV Capture (use kCVPixelFormatType_32BGRA for kCVPixelBufferPixelFormatTypeKey):

@property(nonatomic, strong) AVCaptureSession *session;
@property(nonatomic, strong) AVCaptureVideoDataOutput *videoDataOutput;

- (void)setupVideoProcessing {
  self.videoDataOutput = [[AVCaptureVideoDataOutput alloc] init];
  NSDictionary *rgbOutputSettings = @{
      (__bridge NSString*)kCVPixelBufferPixelFormatTypeKey :  @(kCVPixelFormatType_32BGRA)
  };
  [self.videoDataOutput setVideoSettings:rgbOutputSettings];

  if (![self.session canAddOutput:self.videoDataOutput]) {
    [self cleanupVideoProcessing];
    NSLog(@"Failed to setup video output");
    return;
  }
  [self.videoDataOutput setAlwaysDiscardsLateVideoFrames:YES];
  [self.videoDataOutput setSampleBufferDelegate:self queue:self.videoDataOutputQueue];
  [self.session addOutput:self.videoDataOutput];
}

Consume the CMSampleBuffer and run detection:

- (void)runDetection:(AVCaptureOutput *)captureOutput
    didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer
           fromConnection:(AVCaptureConnection *)connection {

  CVImageBufferRef imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer);
  size_t imageWidth = CVPixelBufferGetWidth(imageBuffer);
  size_t imageHeight = CVPixelBufferGetHeight(imageBuffer);

  AVCaptureDevicePosition devicePosition = self.isUsingFrontCamera ? AVCaptureDevicePositionFront : AVCaptureDevicePositionBack;

  // Calculate the image orientation.
  UIDeviceOrientation deviceOrientation = [[UIDevice currentDevice] orientation];
  ImageOrientation orientation =
      [ImageUtility imageOrientationFromOrientation:deviceOrientation
                        withCaptureDevicePosition:devicePosition
                         defaultDeviceOrientation:[self deviceOrientationFromInterfaceOrientation]];
  // Invoke text detection.
  FIRVisionImage *image = [[FIRVisionImage alloc] initWithBuffer:sampleBuffer];
  FIRVisionImageMetadata *metadata = [[FIRVisionImageMetadata alloc] init];
  metadata.orientation = orientation;
  image.metadata = metadata;

  FIRVisionTextDetectionCallback callback =
      ^(NSArray<id<FIRVisionText>> *_Nullable features, NSError *_Nullable error) {
     ...
  };

 [self.textDetector detectInImage:image completion:callback];
}

The helper function of ImageUtility used above to determine the orientation:

+ (FIRVisionDetectorImageOrientation)imageOrientationFromOrientation:(UIDeviceOrientation)deviceOrientation
                             withCaptureDevicePosition:(AVCaptureDevicePosition)position
                              defaultDeviceOrientation:(UIDeviceOrientation)defaultOrientation {
  if (deviceOrientation == UIDeviceOrientationFaceDown ||
      deviceOrientation == UIDeviceOrientationFaceUp ||
      deviceOrientation == UIDeviceOrientationUnknown) {
    deviceOrientation = defaultOrientation;
  }
  FIRVisionDetectorImageOrientation orientation = FIRVisionDetectorImageOrientationTopLeft;
  switch (deviceOrientation) {
    case UIDeviceOrientationPortrait:
      if (position == AVCaptureDevicePositionFront) {
        orientation = FIRVisionDetectorImageOrientationLeftTop;
      } else {
        orientation = FIRVisionDetectorImageOrientationRightTop;
      }
      break;
    case UIDeviceOrientationLandscapeLeft:
      if (position == AVCaptureDevicePositionFront) {
        orientation = FIRVisionDetectorImageOrientationBottomLeft;
      } else {
        orientation = FIRVisionDetectorImageOrientationTopLeft;
      }
      break;
    case UIDeviceOrientationPortraitUpsideDown:
      if (position == AVCaptureDevicePositionFront) {
        orientation = FIRVisionDetectorImageOrientationRightBottom;
      } else {
        orientation = FIRVisionDetectorImageOrientationLeftBottom;
      }
      break;
    case UIDeviceOrientationLandscapeRight:
      if (position == AVCaptureDevicePositionFront) {
        orientation = FIRVisionDetectorImageOrientationTopRight;
      } else {
        orientation = FIRVisionDetectorImageOrientationBottomRight;
      }
      break;
    default:
      orientation = FIRVisionDetectorImageOrientationTopLeft;
      break;
  }

  return orientation;
}

I really cant explain why... As I had only just installed the pods a few days ago.... I did an update today and now all my code is working fine... No Idea Why! — BlackMirrorz, May 18 '18 at 05:55
@JoshRobbins Great to hear that. There should be no change since I/O announcement. But this is a good surprise :) Thanks for sharing! — Isabella Chen, May 18 '18 at 18:58

Text recognition from a live video stream using ML kit (with CMSampleBuffer)

2 Answers2