Continuous speech recogn. with SFSpeechRecognizer (ios10-beta)

Question

I am trying to perform cont. speech recognition using AVCapture on iOS 10 beta. I have setup captureOutput(...) to continuously get CMSampleBuffers. I put these buffers directly into SFSpeechAudioBufferRecognitionRequest which I set up previously like this:

... do some setup
  SFSpeechRecognizer.requestAuthorization { authStatus in
    if authStatus == SFSpeechRecognizerAuthorizationStatus.authorized {
      self.m_recognizer = SFSpeechRecognizer()
      self.m_recognRequest = SFSpeechAudioBufferRecognitionRequest()
      self.m_recognRequest?.shouldReportPartialResults = false
      self.m_isRecording = true
    } else {
      print("not authorized")
    }
  }
.... do further setup


func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {

if(!m_AV_initialized) {
  print("captureOutput(...): not initialized !")
  return
}
if(!m_isRecording) {
  return
}

let formatDesc = CMSampleBufferGetFormatDescription(sampleBuffer)
let mediaType = CMFormatDescriptionGetMediaType(formatDesc!)
if (mediaType == kCMMediaType_Audio) {
  // process audio here
  m_recognRequest?.appendAudioSampleBuffer(sampleBuffer)
}
return
}

The whole things works for a few seconds. Then captureOutput is not called anymore. If I comment out the line appendAudioSampleBuffer(sampleBuffer) then the captureOutput is called as long as the app runs (as expected). Obviously putting the sample buffers into the speech recognition engine somehow blocks further execution. I guess that the available Buffers are consumed after some time and the process stops somehow because it can't get anymore buffers ???

I should mention that everything that is recorded during the first 2 seconds leads to correct recognitions. I just don't know how exactly the SFSpeech API is working since Apple did not put any text into the beta docs. BTW: How to use SFSpeechAudioBufferRecognitionRequest.endAudio() ?

Anybody knows something here ?

Thanks Chris

Check out Apple's sample code at https://developer.apple.com/library/prerelease/content/samplecode/SpeakToMe/Introduction/Intro.html, it seems to do continuous real-time recognition — David Williames, Jun 15 '16 at 12:02
@DavidWilliames, that sample code uses `AVAudioEngine` and not `AVFoundation`. — SushiGrass Jacob, Jun 21 '16 at 21:48
@chris are you using the delegate method or the call back methods? — SushiGrass Jacob, Jun 21 '16 at 21:51
I implemented in Objective c: https://github.com/yao23/iOS_Playground/tree/master/SpeechRecognitionPractice — Yao Li, Feb 24 '17 at 06:17

score 19 · Answer 1 · edited Oct 11 '16 at 04:28

I converted the SpeakToMe sample Swift code from the Speech Recognition WWDC developer talk to Objective-C, and it worked for me. For Swift, see https://developer.apple.com/videos/play/wwdc2016/509/, or for Objective-C see below.

- (void) viewDidAppear:(BOOL)animated {

_recognizer = [[SFSpeechRecognizer alloc] initWithLocale:[NSLocale localeWithLocaleIdentifier:@"en-US"]];
[_recognizer setDelegate:self];
[SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus authStatus) {
    switch (authStatus) {
        case SFSpeechRecognizerAuthorizationStatusAuthorized:
            //User gave access to speech recognition
            NSLog(@"Authorized");
            break;

        case SFSpeechRecognizerAuthorizationStatusDenied:
            //User denied access to speech recognition
            NSLog(@"SFSpeechRecognizerAuthorizationStatusDenied");
            break;

        case SFSpeechRecognizerAuthorizationStatusRestricted:
            //Speech recognition restricted on this device
            NSLog(@"SFSpeechRecognizerAuthorizationStatusRestricted");
            break;

        case SFSpeechRecognizerAuthorizationStatusNotDetermined:
            //Speech recognition not yet authorized

            break;

        default:
            NSLog(@"Default");
            break;
    }
}];

audioEngine = [[AVAudioEngine alloc] init];
_speechSynthesizer  = [[AVSpeechSynthesizer alloc] init];         
[_speechSynthesizer setDelegate:self];
}


-(void)startRecording
{
[self clearLogs:nil];

NSError * outError;

AVAudioSession *audioSession = [AVAudioSession sharedInstance];
[audioSession setCategory:AVAudioSessionCategoryRecord error:&outError];
[audioSession setMode:AVAudioSessionModeMeasurement error:&outError];
[audioSession setActive:true withOptions:AVAudioSessionSetActiveOptionNotifyOthersOnDeactivation  error:&outError];

request2 = [[SFSpeechAudioBufferRecognitionRequest alloc] init];

inputNode = [audioEngine inputNode];

if (request2 == nil) {
    NSLog(@"Unable to created a SFSpeechAudioBufferRecognitionRequest object");
}

if (inputNode == nil) {

    NSLog(@"Unable to created a inputNode object");
}

request2.shouldReportPartialResults = true;

_currentTask = [_recognizer recognitionTaskWithRequest:request2
                delegate:self];

[inputNode installTapOnBus:0 bufferSize:4096 format:[inputNode outputFormatForBus:0] block:^(AVAudioPCMBuffer *buffer, AVAudioTime *when){
    NSLog(@"Block tap!");

    [request2 appendAudioPCMBuffer:buffer];

}];

    [audioEngine prepare];
    [audioEngine startAndReturnError:&outError];
    NSLog(@"Error %@", outError);
}

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)result {

NSLog(@"speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition");
NSString * translatedString = [[[result bestTranscription] formattedString] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

[self log:translatedString];

if ([result isFinal]) {
    [audioEngine stop];
    [inputNode removeTapOnBus:0];
    _currentTask = nil;
    request2 = nil;
}
}

The question is tagged "Swift". Why do you post Objective-C code *translated from Swift* on a Swift question?! — Eric Aya, Jul 10 '16 at 11:32
Because people looking at this question may have the exact same question about Objective-C, and having a whole separate question would be redundant. — Ruben Martinez Jr., Sep 16 '16 at 10:06
Nice Answer, forgot the part where you speak the text though // Called for all recognitions, including non-final hypothesis - (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didHypothesizeTranscription:(SFTranscription *)transcription { NSString * translatedString = [transcription formattedString]; NSLog(@"%@", translatedString); [self.speechSynthesizer speakUtterance:[AVSpeechUtterance speechUtteranceWithString:translatedString]]; } — Orbitus007, Sep 23 '16 at 19:21
its generate error: AVAudioEngineGraph required condition is false: NULL != tap — Nilesh Parmar, May 11 '17 at 09:59

score 14 · Answer 2 · answered Jul 07 '16 at 09:31

I have success to use the SFSpeechRecognizer in continuous. The main point is to use AVCaptureSession to capture audio and transfer to SpeechRecognizer. Sorry I am poor in Swift,so just the ObjC version.

Here is my sample code (leave out some UI code,some important has marked):

@interface ViewController ()<AVCaptureAudioDataOutputSampleBufferDelegate,SFSpeechRecognitionTaskDelegate>
@property (nonatomic, strong) AVCaptureSession *capture;
@property (nonatomic, strong) SFSpeechAudioBufferRecognitionRequest *speechRequest;
@end

@implementation ViewController
- (void)startRecognizer
{
    [SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus status) {
        if (status == SFSpeechRecognizerAuthorizationStatusAuthorized){
            NSLocale *local =[[NSLocale alloc] initWithLocaleIdentifier:@"fr_FR"];
            SFSpeechRecognizer *sf =[[SFSpeechRecognizer alloc] initWithLocale:local];
            self.speechRequest = [[SFSpeechAudioBufferRecognitionRequest alloc] init];
            [sf recognitionTaskWithRequest:self.speechRequest delegate:self];
            // should call startCapture method in main queue or it may crash
            dispatch_async(dispatch_get_main_queue(), ^{
                [self startCapture];
            });
        }
    }];
}

- (void)endRecognizer
{
    // END capture and END voice Reco
    // or Apple will terminate this task after 30000ms.
    [self endCapture];
    [self.speechRequest endAudio];
}

- (void)startCapture
{
    NSError *error;
    self.capture = [[AVCaptureSession alloc] init];
    AVCaptureDevice *audioDev = [AVCaptureDevice defaultDeviceWithMediaType:AVMediaTypeAudio];
    if (audioDev == nil){
        NSLog(@"Couldn't create audio capture device");
        return ;
    }

    // create mic device
    AVCaptureDeviceInput *audioIn = [AVCaptureDeviceInput deviceInputWithDevice:audioDev error:&error];
    if (error != nil){
        NSLog(@"Couldn't create audio input");
        return ;
    }

    // add mic device in capture object
    if ([self.capture canAddInput:audioIn] == NO){
        NSLog(@"Couldn't add audio input");
        return ;
    }
    [self.capture addInput:audioIn];
    // export audio data
    AVCaptureAudioDataOutput *audioOutput = [[AVCaptureAudioDataOutput alloc] init];
    [audioOutput setSampleBufferDelegate:self queue:dispatch_get_main_queue()];
    if ([self.capture canAddOutput:audioOutput] == NO){
        NSLog(@"Couldn't add audio output");
        return ;
    }
    [self.capture addOutput:audioOutput];
    [audioOutput connectionWithMediaType:AVMediaTypeAudio];
    [self.capture startRunning];
}

-(void)endCapture
{
    if (self.capture != nil && [self.capture isRunning]){
        [self.capture stopRunning];
    }
}

- (void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection
{
    [self.speechRequest appendAudioSampleBuffer:sampleBuffer];
}
// some Recognition Delegate
@end

In my case unable to call delegate method.. here is code - (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)result { NSLog(@"speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition"); NSString * translatedString = [[[result bestTranscription] formattedString] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; NSLog(@"Says : %@", translatedString); } — Jagdeep, Jan 27 '17 at 13:40
Did it work beyond the 1-minute limit of the Speech framework? Otherwise you will need to restart the recognizer as soon as that happens to achieve a "continuous" recognizer behavior. — Josh, Feb 08 '17 at 08:48
Any idea if continouus speec recognition will get the app rejected by Apple? — MJQZ1347, Apr 19 '19 at 16:08
By continuous recognition, were you able to work it past the 1-minute mark? — Aman pradhan, Apr 26 '19 at 11:24

score 10 · Answer 3 · answered Sep 11 '16 at 08:20

Here is the Swift (3.0) implementation of @cube's answer:

import UIKit
import Speech
import AVFoundation


class ViewController: UIViewController  {
  @IBOutlet weak var console: UITextView!

  var capture: AVCaptureSession?
  var speechRequest: SFSpeechAudioBufferRecognitionRequest?
  override func viewDidLoad() {
    super.viewDidLoad()
  }
  override func viewDidAppear(_ animated: Bool) {
    super.viewDidAppear(animated)
    startRecognizer()
  }

  func startRecognizer() {
    SFSpeechRecognizer.requestAuthorization { (status) in
      switch status {
      case .authorized:
        let locale = NSLocale(localeIdentifier: "fr_FR")
        let sf = SFSpeechRecognizer(locale: locale as Locale)
        self.speechRequest = SFSpeechAudioBufferRecognitionRequest()
        sf?.recognitionTask(with: self.speechRequest!, delegate: self)
        DispatchQueue.main.async {

        }
      case .denied:
        fallthrough
      case .notDetermined:
        fallthrough
      case.restricted:
        print("User Autorization Issue.")
      }
    }

  }

  func endRecognizer() {
    endCapture()
    speechRequest?.endAudio()
  }

  func startCapture() {

    capture = AVCaptureSession()

    guard let audioDev = AVCaptureDevice.defaultDevice(withMediaType: AVMediaTypeAudio) else {
      print("Could not get capture device.")
      return
    }

    guard let audioIn = try? AVCaptureDeviceInput(device: audioDev) else {
      print("Could not create input device.")
      return
    }

    guard true == capture?.canAddInput(audioIn) else {
      print("Couls not add input device")
      return
    }

    capture?.addInput(audioIn)

    let audioOut = AVCaptureAudioDataOutput()
    audioOut.setSampleBufferDelegate(self, queue: DispatchQueue.main)

    guard true == capture?.canAddOutput(audioOut) else {
      print("Could not add audio output")
      return
    }

    capture?.addOutput(audioOut)
    audioOut.connection(withMediaType: AVMediaTypeAudio)
    capture?.startRunning()


  }

  func endCapture() {

    if true == capture?.isRunning {
      capture?.stopRunning()
    }
  }
}

extension ViewController: AVCaptureAudioDataOutputSampleBufferDelegate {
  func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {
    speechRequest?.appendAudioSampleBuffer(sampleBuffer)
  }

}

extension ViewController: SFSpeechRecognitionTaskDelegate {

  func speechRecognitionTask(_ task: SFSpeechRecognitionTask, didFinishRecognition recognitionResult: SFSpeechRecognitionResult) {
    console.text = console.text + "\n" + recognitionResult.bestTranscription.formattedString
  }
}

Don't forget to add a value for NSSpeechRecognitionUsageDescription in info.plist file or otherwise it'll crash.

Are you supposed to call `startCapture()` in `DispatchQueue.main.async { }` — ielyamani, Mar 05 '17 at 15:15

score 7 · Answer 4 · answered Nov 07 '16 at 08:05

It turns out, that Apple's new native Speech Recognition does not detect end of Speech silences automatically(a bug?), which for your case is useful, because Speech Recognition is active for nearly one minute (the maximum period, permitted by Apple's service). So basically if you need to have continuous ASR you must re-launch speech recognition when your delegate triggers:

func speechRecognitionTask(task: SFSpeechRecognitionTask, didFinishSuccessfully successfully: Bool) //wether succesfully= true or not

Here is the recording/Speech recognition SWIFT code I use, it works perfectly. Ignore the part of where I calculate the mean power of the microphone volume, if you don't need it. I use it to animate a waveform. Don't forget to set the SFSpeechRecognitionTaskDelegate, and is delegate methods, if you need extra code, let me know.

func startNativeRecording() throws {
        LEVEL_LOWPASS_TRIG=0.01
        //Setup Audio Session
        node = audioEngine.inputNode!
        let recordingFormat = node!.outputFormatForBus(0)
        node!.installTapOnBus(0, bufferSize: 1024, format: recordingFormat){(buffer, _) in
            self.nativeASRRequest.appendAudioPCMBuffer(buffer)

 //Code to animate a waveform with the microphone volume, ignore if you don't need it:
            var inNumberFrames:UInt32 = buffer.frameLength;
            var samples:Float32 = buffer.floatChannelData[0][0]; //https://github.com/apple/swift-evolution/blob/master/proposals/0107-unsaferawpointer.md
            var avgValue:Float32 = 0;
            vDSP_maxmgv(buffer.floatChannelData[0], 1, &avgValue, vDSP_Length(inNumberFrames)); //Accelerate Framework
            //vDSP_maxmgv returns peak values
            //vDSP_meamgv returns mean magnitude of a vector

            let avg3:Float32=((avgValue == 0) ? (0-100) : 20.0)
            var averagePower=(self.LEVEL_LOWPASS_TRIG*avg3*log10f(avgValue)) + ((1-self.LEVEL_LOWPASS_TRIG)*self.averagePowerForChannel0) ;
            print("AVG. POWER: "+averagePower.description)
            dispatch_async(dispatch_get_main_queue(), { () -> Void in
                //print("VU: "+vu.description)
                var fAvgPwr=CGFloat(averagePower)
                print("AvgPwr: "+fAvgPwr.description)

                var waveformFriendlyValue=0.5+fAvgPwr //-0.5 is AvgPwrValue when user is silent
                if(waveformFriendlyValue<0){waveformFriendlyValue=0} //round values <0 to 0
                self.waveview.hidden=false
                self.waveview.updateWithLevel(waveformFriendlyValue)
            })
        }
        audioEngine.prepare()
        try audioEngine.start()
        isNativeASRBusy=true
        nativeASRTask = nativeSpeechRecognizer?.recognitionTaskWithRequest(nativeASRRequest, delegate: self)
        nativeSpeechRecognizer?.delegate=self
  //I use this timer to track no speech timeouts, ignore if not neeeded:
        self.endOfSpeechTimeoutTimer = NSTimer.scheduledTimerWithTimeInterval(utteranceTimeoutSeconds, target: self, selector:  #selector(ViewController.stopNativeRecording), userInfo: nil, repeats: false)
    }

It is just a parameter with a value between -1 and 1, I used a value of -0.2 to resize the microphone volume graph to fit the UI of my app. If you don't need to plot mic volume, then you can give it a value of zero, or simply take out that part of code. @MarkusRautopuro — Josh, Nov 28 '16 at 10:46
`avgValue` is actually the maximum not the average, consider renaming it — ielyamani, Mar 04 '17 at 19:49
@aBikis I am not sure, I saw it on a formula online, setting Level_Lowpass_Trig to 0.01 worked for me. — Josh, Jun 06 '17 at 08:50
If you can do a hello world on your Apple watch all you have too do is to copy-paste that code and call the method. Yeah, there's been changes in Swift and Watch OS, so you might need to fix 2 or 3 deprecated lines of code. @lya — Josh, Dec 11 '17 at 13:17

score 2 · Answer 5 · answered Jun 22 '21 at 15:08

If you enable on-device only recognition, it will not automatically stop speech recognition after 1-minute.

.requiresOnDeviceRecognition = true

More about requiresOnDeviceRecognition ;

https://developer.apple.com/documentation/speech/sfspeechrecognitionrequest/3152603-requiresondevicerecognition

fromlucknow · Answer 6 · 2020-06-26T06:27:10.397

This works perfectly in my app. You can ask for queries at saifurrahman3126@gmail.com Apple does not allow users to continuously translate for more than one minute. https://developer.apple.com/documentation/speech/sfspeechrecognizer check here

"Plan for a one-minute limit on audio duration. Speech recognition places a relatively high burden on battery life and network usage. To minimize this burden, the framework stops speech recognition tasks that last longer than one minute. This limit is similar to the one for keyboard-related dictation." This is what Apple says in its documentation.

For now, I have made requests for 40 seconds and then I reconnect it again if you speak before 40 secs and then pause , the recording will start again.

@objc  func startRecording() {
    
    self.fullsTring = ""
    audioEngine.reset()
    
    if recognitionTask != nil {
        recognitionTask?.cancel()
        recognitionTask = nil
    }
    
    let audioSession = AVAudioSession.sharedInstance()
    do {
        try audioSession.setCategory(.record)
        try audioSession.setMode(.measurement)
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        try audioSession.setPreferredSampleRate(44100.0)
        
        if audioSession.isInputGainSettable {
            let error : NSErrorPointer = nil
            
            let success = try? audioSession.setInputGain(1.0)
            
            guard success != nil else {
                print ("audio error")
                return
            }
            if (success != nil) {
                print("\(String(describing: error))")
            }
        }
        else {
            print("Cannot set input gain")
        }
    } catch {
        print("audioSession properties weren't set because of an error.")
    }
    recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
    
    let inputNode = audioEngine.inputNode
    guard let recognitionRequest = recognitionRequest else {
        fatalError("Unable to create an SFSpeechAudioBufferRecognitionRequest object")
    }
    
    recognitionRequest.shouldReportPartialResults = true
    self.timer4 = Timer.scheduledTimer(timeInterval: TimeInterval(40), target: self, selector: #selector(againStartRec), userInfo: nil, repeats: false)
    
    recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest, resultHandler: { (result, error ) in
        
        var isFinal = false  //8
        
        if result != nil {
            self.timer.invalidate()
            self.timer = Timer.scheduledTimer(timeInterval: TimeInterval(2.0), target: self, selector: #selector(self.didFinishTalk), userInfo: nil, repeats: false)
            
            let bestString = result?.bestTranscription.formattedString
            self.fullsTring = bestString!
            
            self.inputContainerView.inputTextField.text = result?.bestTranscription.formattedString
            
            isFinal = result!.isFinal
            
        }
        if error == nil{
            
        }
        if  isFinal {
            
            self.audioEngine.stop()
            inputNode.removeTap(onBus: 0)
            
            self.recognitionRequest = nil
            self.recognitionTask = nil
            isFinal = false
            
        }
        if error != nil{
            URLCache.shared.removeAllCachedResponses()
            
            self.audioEngine.stop()
            inputNode.removeTap(onBus: 0)
            
            guard let task = self.recognitionTask else {
                return
            }
            task.cancel()
            task.finish()
        }
    })
    audioEngine.reset()
    inputNode.removeTap(onBus: 0)
    
    let recordingFormat = AVAudioFormat(standardFormatWithSampleRate: 44100, channels: 1)
    inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
        self.recognitionRequest?.append(buffer)
    }
    
    audioEngine.prepare()
    
    do {
        try audioEngine.start()
    } catch {
        print("audioEngine couldn't start because of an error.")
    }
    
    self.hasrecorded = true
}

@objc func againStartRec(){
    
    self.inputContainerView.uploadImageView.setBackgroundImage( #imageLiteral(resourceName: "microphone") , for: .normal)
    self.inputContainerView.uploadImageView.alpha = 1.0
    self.timer4.invalidate()
    timer.invalidate()
    self.timer.invalidate()
    
    if ((self.audioEngine.isRunning)){
        
        self.audioEngine.stop()
        self.recognitionRequest?.endAudio()
        self.recognitionTask?.finish()
    }
    self.timer2 = Timer.scheduledTimer(timeInterval: 2, target: self, selector: #selector(startRecording), userInfo: nil, repeats: false)
}

@objc func didFinishTalk(){
    
    if self.fullsTring != ""{
        
        self.timer4.invalidate()
        self.timer.invalidate()
        self.timer2.invalidate()
        
        if ((self.audioEngine.isRunning)){
            self.audioEngine.stop()
            guard let task = self.recognitionTask else {
                return
            }
            task.cancel()
            task.finish()
        }
    }
}

Continuous speech recogn. with SFSpeechRecognizer (ios10-beta)

6 Answers6

Linked