How to correctly set up AVAudioSession and AVAudioEngine when using both SFSpeechRecognizer and AVSpeechSythesizer

Question

I am trying to create an app that leverages both STT (Speech to Text) and TTS (Text to Speech) at the same time. However, I am running into a couple of foggy issues and would appreciate your kind expertise.

The app consists of a button at the center of the screen which, upon clicking, starts the required speech recognition functionality using the code below.

// MARK: - Constant Properties

let audioEngine = AVAudioEngine()



// MARK: - Optional Properties

var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
var recognitionTask: SFSpeechRecognitionTask?
var speechRecognizer: SFSpeechRecognizer?



// MARK: - Functions

internal func startSpeechRecognition() {

    // Instantiate the recognitionRequest property.
    self.recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

    // Set up the audio session.
    let audioSession = AVAudioSession.sharedInstance()
    do {
        try audioSession.setCategory(.record, mode: .measurement, options: [.defaultToSpeaker, .duckOthers])
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
    } catch {
        print("An error has occurred while setting the AVAudioSession.")
    }

    // Set up the audio input tap.
    let inputNode = self.audioEngine.inputNode
    let inputNodeFormat = inputNode.outputFormat(forBus: 0)

    self.audioEngine.inputNode.installTap(onBus: 0, bufferSize: 512, format: inputNodeFormat, block: { [unowned self] buffer, time in
        self.recognitionRequest?.append(buffer)
    })

    // Start the recognition task.
    guard
        let speechRecognizer = self.speechRecognizer,
        let recognitionRequest = self.recognitionRequest else {
            fatalError("One or more properties could not be instantiated.")
    }

    self.recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest, resultHandler: { [unowned self] result, error in

        if error != nil {

            // Stop the audio engine and recognition task.
            self.stopSpeechRecognition()

        } else if let result = result {

            let bestTranscriptionString = result.bestTranscription.formattedString

            self.command = bestTranscriptionString
            print(bestTranscriptionString)

        }

    })

    // Start the audioEngine.
    do {
        try self.audioEngine.start()
    } catch {
        print("Could not start the audioEngine property.")
    }

}



internal func stopSpeechRecognition() {

    // Stop the audio engine.
    self.audioEngine.stop()
    self.audioEngine.inputNode.removeTap(onBus: 0)

    // End and deallocate the recognition request.
    self.recognitionRequest?.endAudio()
    self.recognitionRequest = nil

    // Cancel and deallocate the recognition task.
    self.recognitionTask?.cancel()
    self.recognitionTask = nil

}

When used alone, this code works like a charm. However, when I want to read that transcribed text using an AVSpeechSynthesizer object, nothing seems to be clear.

I went through the suggestions of multiple Stack Overflow posts, which suggested modifying

audioSession.setCategory(.record, mode: .measurement, options: [.defaultToSpeaker, .duckOthers])

To the following

audioSession.setCategory(.playAndRecord, mode: .default, options: [.defaultToSpeaker, .duckOthers])

Yet in vain. The app was still crashing after running STT then TTS, respectively.

The solution was for me to use this rather than the aforementioned

audioSession.setCategory(.multiRoute, mode: .default, options: [.defaultToSpeaker, .duckOthers])

This got me completely overwhelmed as I really have no clue what was intricately going on. I would highly appreciate any relevant explanation!

score 2 · Accepted Answer · answered Nov 08 '18 at 01:02

I am developing an app with both SFSpeechRecognizer and AVSpeechSythesizer too, and for me the .setCategory(.playAndRecord, mode: .default) works fine and it is the best category for our needs, according to Apple. Even, I am able to .speak() every transcription of the SFSpeechRecognitionTask while the audio engine is running without any problem. My opinion is somewhere in your programm's logic causes the crash. It would be good if you can update your question with the corresponding error.

And about why the .multiRoute category works: I guess there is a problem with the AVAudioInputNode. If you see in the console and error like this

Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: IsFormatSampleRateAndChannelCountValid(hwFormat)

or like this

Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: nullptr == Tap()

you only needs to reorder some parts of the code like moving the setup of the audio session somewhere where it only gets called once, or ensure that the tap of the input node is always removed before installing a new one even if the recognition task finish successfully or not. And maybe (I have never worked with it) the .multiRoute is able to reuse the same input node by its nature of working with different audio streams and routes.

I leave below the logic I use with my programm following Apple's WWDC session:

Setting category

override func viewDidLoad() { //or init() or when necessarily
    super.viewDidLoad()
    try? AVAudioSession.sharedInstance().setCategory(.playAndRecord, mode: .default)
}

Validations/permissions

func shouldProcessSpeechRecognition() {
    guard AVAudioSession.sharedInstance().recordPermission == .granted,
        speechRecognizerAuthorizationStatus == .authorized,
        let speechRecognizer = speechRecognizer, speechRecognizer.isAvailable else { return }
        //Continue only if we have authorization and recognizer is available

        startSpeechRecognition()
}

Starting STT

func startSpeechRecognition() {
    let format = audioEngine.inputNode.outputFormat(forBus: 0)
    audioEngine.inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { [unowned self] (buffer, _) in
        self.recognitionRequest.append(buffer)
    }
    audioEngine.prepare()
    do {
        try audioEngine.start()
        recognitionTask = speechRecognizer!.recognitionTask(with: recognitionRequest, resultHandler: {...}
    } catch {...}
}

Ending STT

func endSpeechRecognition() {
    recognitionTask?.finish()
    stopAudioEngine()
}

Canceling STT

func cancelSpeechRecognition() {
    recognitionTask?.cancel()
    stopAudioEngine()
}

Stoping audio engine

func stopAudioEngine() {
    audioEngine.stop()
    audioEngine.inputNode.removeTap(onBus: 0)
    recognitionRequest.endAudio()        
}

And with that, anywhere in my code I can call an AVSpeechSynthesizer instance and speak an utterance.

Hello and thank you for your constructive answer. Your code works perfectly fine, but have you tried adding `.defaultToSpeaker` in the `AVAudioSession.sharedInstance().setCategory(.playAndRecord, mode: .default)` method? Everything stops working when I do! — Jad Ghadry, Nov 08 '18 at 10:07
Yes, I already did it and all works fine for me. I am running it on two physical devices: iPhone 6 with 12.1 and a jailbroken iPhone 6 with 10.3.2. — Ángel Téllez, Nov 08 '18 at 18:33
Your solution finally worked. I realized that I was calling `.setCategory` I was starting my audio engine. Moving that line of code inside `viewDidLoad` was the adequate solution for me, and your explanation was flawless. Thank you for your kind help! — Jad Ghadry, Nov 08 '18 at 23:36