4

In my iOS app, I am trying to transcribe prerecorded audio using iOS 10's latest feature, the Speech API.

Multiple sources including the documentation have stated that the audio duration limit for the Speech API (more specifically SFSpeechRecognizer) is 1 minute.

In my code, I have found that any audio files with a length of about 15 seconds or more, will get the following error.

Error Domain=kAFAssistantErrorDomain Code=203 "SessionId=com.siri.cortex.ace.speech.session.event.SpeechSessionId@50a8e246, Message=Timeout waiting for command after 30000 ms" UserInfo={NSLocalizedDescription=SessionId=com.siri.cortex.ace.speech.session.event.SpeechSessionId@50a8e246, Message=Timeout waiting for command after 30000 ms, NSUnderlyingError=0x170248c40 {Error Domain=SiriSpeechErrorDomain Code=100 "(null)"}}

I have searched all over the internet and have not been able to find a solution to this. There also have been people with the same problem. Some people suspect that it's a problem with Nuance.

It is also worth noting that I do get partial results from the transcription process.

Here's the code from my iOS app. ` // Create a speech recognizer request object. let srRequest = SFSpeechURLRecognitionRequest(url: location) srRequest.shouldReportPartialResults = false

    sr?.recognitionTask(with: srRequest) { (result, error) in
        if let error = error {
            // Something wrong happened
            print(error.localizedDescription)
        } else {
            if let result = result {
                print(4)
                print(result.bestTranscription.formattedString)
                if result.isFinal {
                    print(5)
                    transcript = result.bestTranscription.formattedString
                    print(result.bestTranscription.formattedString)

                    // Store the transcript into the database.
                    print("\nSiri-Transcript: " + transcript!)

                    // Store the audio transcript into Firebase Realtime Database
                    self.firebaseRef = FIRDatabase.database().reference()

                    let ud = UserDefaults.standard
                    if let uid = ud.string(forKey: "uid") {
                        print("Storing the transcript into the database.")
                        let path = "users" + "/" + uid + "/" + "siri_transcripts" + "/" + date_recorded + "/" + filename.components(separatedBy: ".")[0]
                        print("transcript database path: \(path)")
                        self.firebaseRef.child(path).setValue(transcript)
                    }
                }
            }
        }
    }`

Thank you for your help.

itsSLO
  • 359
  • 1
  • 4
  • 10

3 Answers3

2

I haven't confirmed my answer aside from someone else running into the same problem but I believe it is an undocumented limit on prerecorded audio.

itsSLO
  • 359
  • 1
  • 4
  • 10
1

Remove the result.isFinal and do a null check for the result instead. Reference: https://github.com/mssodhi/Jarvis-ios/blob/master/Jarvis-ios/HomeCell%2Bspeech.swift

Manu Sodhi
  • 92
  • 1
  • 3
  • Thank you for your response. Unfortunately, the problem still exist. I think it has to do with Apple changing the limits of the audio file length without changing the documentation. It could also be an undocumented limit on prerecorded audio. – itsSLO Jul 22 '17 at 22:54
  • @itsSLO have you tried with live audio feed i.e. using mic instead of pre recorded file? – Manu Sodhi Jul 27 '17 at 20:22
  • I have not because I don't have a user case at the moment for that situation. With that said, I do believe that using a live audio feed will have a higher limit than prerecorded audio. – itsSLO Jul 28 '17 at 22:10
0

This is true, I extracted the audio file from the video, and if it exceeds 15 seconds, it will give the following error:

Domain = kAFAssistantErrorDomain Code = 203 "Timeout" UserInfo = {
    NSLocalizedDescription = Timeout,
    NSUnderlyingError = 0x1c0647950 {Error Domain=SiriSpeechErrorDomain Code=100 "(null)"}
}

The key issue is the audio file recognition after more than 15 seconds. result.isFinal is always 0, which is very frustrating is that there is no accurate timestamp, although it is "Timeout", it has complete recognition content, which makes me feel weird.

If you print out the result traversal, you can see that there is some restriction, which is 15 seconds, but the reason is that the timestamp feedback of the audio file is limited to a limited number, such as 15 or 4 or 9, leading to the end. Timeout feedback is more unstable.

But in real-time speech recognition, you can break through 15 seconds, as described in the official documentation, within one minute.

Achraf Almouloudi
  • 756
  • 10
  • 27
朱耀宇
  • 1
  • 2