4

I am working on an iOS project that uses SFSpeechRecognizer, it works fine in the beginning. I speak some word and it responds. But after one or two minutes, it just fails. It does not give any feedbacks of recognized results. I wonder if that's related to the buffer, but I don't know how to fix it.

I basically used the demo of SpeechRecognizer to build the project. The difference is that I store the recognized result word by word in an array. And the program analyzes the array and responds to certain words like "play" or some other commands that were set previously. After the program responds to commands, it deletes this element of the array.

Talk is cheap, here is the code:

  1. The recognizer, you can see the supportedCommands array that filter some specific words for the program to respond. The other parts are similar to the demo at https://developer.apple.com/library/content/samplecode/SpeakToMe/Listings/SpeakToMe_ViewController_swift.html#//apple_ref/doc/uid/TP40017110-SpeakToMe_ViewController_swift-DontLinkElementID_6

    class SpeechRecognizer: NSObject, SFSpeechRecognizerDelegate {
    
        private var speechRecognizer: SFSpeechRecognizer!
        private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest!
        private var recognitionTask: SFSpeechRecognitionTask!
        private let audioEngine = AVAudioEngine()
        private let locale = Locale(identifier: "en-US")
    
        private var lastSavedString: String = ""
        private let supportedCommands = ["more", "play"]
    
        var speechInputQueue: [String] = [String]()
    
        func load() {
            print("load")
            prepareRecognizer(locale: locale)
    
            authorize()
        }
    
        func start() {
            print("start")
            if !audioEngine.isRunning {
                try! startRecording()
            }
        }
    
        func stop() {
            if audioEngine.isRunning {
                audioEngine.stop()
                recognitionRequest?.endAudio()
    
            }
        }
    
        private func authorize() {
            SFSpeechRecognizer.requestAuthorization { authStatus in
                OperationQueue.main.addOperation {
                    switch authStatus {
                    case .authorized:
                        print("Authorized!")
                    case .denied:
                        print("Unauthorized!")
                    case .restricted:
                        print("Unauthorized!")
                    case .notDetermined:
                        print("Unauthorized!")
                    }
                }
            }
        }
    
        private func prepareRecognizer(locale: Locale) {
            speechRecognizer = SFSpeechRecognizer(locale: locale)!
            speechRecognizer.delegate = self
        }
    
        private func startRecording() throws {
    
            // Cancel the previous task if it's running.
            if let recognitionTask = recognitionTask {
                recognitionTask.cancel()
                self.recognitionTask = nil
            }
    
            let audioSession = AVAudioSession.sharedInstance()
            try audioSession.setCategory(AVAudioSessionCategoryPlayAndRecord, with: .defaultToSpeaker)
            try audioSession.setMode(AVAudioSessionModeDefault)
            try audioSession.setActive(true, with: .notifyOthersOnDeactivation)
    
            recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
    
            let inputNode = audioEngine.inputNode
            guard let recognitionRequest = recognitionRequest else { fatalError("Unable to created a SFSpeechAudioBufferRecognitionRequest object") }
    
            // Configure request so that results are returned before audio recording is finished
            recognitionRequest.shouldReportPartialResults = true
    
            // A recognition task represents a speech recognition session.
            // We keep a reference to the task so that it can be cancelled.
            recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
                var isFinal = false
    
                if let result = result {
    
                    let temp = result.bestTranscription.formattedString.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines).lowercased()
                    //print("temp", temp)
                    if temp != self.lastSavedString && temp.count > self.lastSavedString.count {
    
                        var tempSplit = temp.split(separator: " ")
                        var lastSplit = self.lastSavedString.split(separator: " ")
                        while lastSplit.count > 0 {
                            if String(tempSplit[0]) == String(lastSplit[0]) {
                                tempSplit.remove(at: 0)
                                lastSplit.remove(at: 0)
                            }
                            else {
                                break
                            }
                        }
    
                        for command in tempSplit {
                            if self.supportedCommands.contains(String(command)) {
                                self.speechInputQueue.append(String(command))
                            }
                        }
                        self.lastSavedString = temp
    
                    }
                    isFinal = result.isFinal
                }
    
                if error != nil || isFinal {
                    self.audioEngine.stop()
                    inputNode.removeTap(onBus: 0)
                    self.recognitionRequest = nil
                    self.recognitionTask = nil
                }
            }
    
            let recordingFormat = inputNode.outputFormat(forBus: 0)
            inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
                self.recognitionRequest?.append(buffer)
            }
    
            audioEngine.prepare()
    
            try audioEngine.start()
    
        }
    }
    
  2. How we use it:

        if self.speechRecognizer.speechInputQueue.count > 0 {
        if self.speechRecognizer.speechInputQueue[0] == "more" {
            print("temp", temp)
            print("content", content)
           // isSpeakingContent = true
            self.textToSpeech(text: content)
        }
        else if self.speechRecognizer.speechInputQueue[0] == "play" {
            print("try to play")
            let soundURL = URL(fileURLWithPath: Bundle.main.path(forResource: "cascade", ofType: "wav")!)
    
            do {
                audioPlayer = try AVAudioPlayer(contentsOf: soundURL)
            }
            catch {
                print(error)
            }
            audioPlayer.prepareToPlay()
            audioPlayer.play()
        }
        else {
            self.textToSpeech(text: "unrecognized command")
        }
        self.speechRecognizer.speechInputQueue.remove(at: 0)
        print("after :", self.speechRecognizer.speechInputQueue)
    }
    

It responds to certain commands and plays some audio.

Is there any problem with the Buffer? Maybe after one or two minutes' recognization, the buffer is full? The recognizer just fails over time.

picciano
  • 22,341
  • 9
  • 69
  • 82
Jerry Chang
  • 67
  • 1
  • 7

1 Answers1

1

From WWDC 2016 Session 509: Speech Recognition API:

For iOS 10 we're starting with a strict audio duration limit of about one minute which is similar to that of keyboard dictation.

rob mayoff
  • 375,296
  • 67
  • 796
  • 848
  • What about when recognizing speech from an audio file? Can I extend the duration without limit? – daniel Sep 02 '23 at 13:01