5

I am an Android developer who is living with hearing impairment and I am currently exploring the option of making a speech to text app with Speech Recognizer API in Android. Closed-captioning telephones and Innocaption are not available in in my home country. Potential applications might be like captioning during telephone calls.

https://developer.android.com/reference/android/speech/SpeechRecognizer.html

The API is meant for capturing voice commands, not for real-time live transcribing. I am even able to implement it as a service but I constantly need to restart it after it has delivered a result or a partial result, which is not feasible in a conversational setting (words get lost while the service is restarting).

Do note that I don't need a 100% accuracy for this app. Many hearing impaired people find it helpful to have some context of the conversation to help them along. So I don't actually need comments about how this is not going to be accurate.

Is there a way to implement Speech Recognizer in a continuous mode? I can create a textview that constantly updates itself when new text is returned from the service. If this API is not what I should be looking at, is there any recommendation? I tested CMUSphinx but find that it is too dependent on blocks of phrases/sentences that it is not likely to work for the kind of application I have in mind.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Lorteld
  • 153
  • 1
  • 8

3 Answers3

5

I am a deaf software developer, so I can chime in. I've been monitoring the state of art of Speech-To-Text APIs, and the APIs have now become "good enough" to provide operatorless relay/captioning services for CERTAIN kinds of phone conversations with people using telephone in quiet settings. For example, I get 98% transcription accuracy with my spouse's voice with the Apple Siri realtime transcription (iOS 8).

I was able to jerryrig phone captioning by routing the sound out of one phone, to a 2nd iPhone that I press the microphone button (popup keyboard), and successfully captioned a telephone conversation with ~95% accuracy at 250 words per minute (faster than Sprint Captioned Telephone and Hamilton Captioned Telephone), at least until the 1 minute cutoff time.

Thusly, I declare computer-based voice recognition practical for phone calls with family members (of the type you call frequently in quiet environments), where you can at least coach them to move to a quiet place to allow captioning to work properly (with >95% accuracy). Since iOS 8 got released, we REALLY need this, so we don't need to rely on rely operators or captioning telephone. Sprint Captioned telephone lags badly during fast speech, while Apple Siri keeps up, so I can conduct more natural telephone conversations with my jerryrigged two-iOS-device Apple Siri "realtime Captioned Telephone" setup.

Some cellphones transmit audio in a higher-def manner, so it works well between two iPhones (iPhone speaker piped into another iPhone's Siri running in iOS8 continuous mode). That's assuming you're on G.722.2 (AMR-WB), like when running two iPhones on the same carrier that supports the high-def audio telephony standard. It works perfectly when piped through Siri -- roughly as good as doing it in front of the phone, for the same human voice (assuming the other end is speaking into the phone in a quiet environment).

Google and Apple needs to open up their speech-to-text APIs to assistive applications, pronto, because operatorless telephone transcription is finally now practical, at least when calling family members (good voices & coached to be in a quiet environment when receiving call). The continuous recognition time limit needs to also be removed during this situation, too.

Mark Rejhon
  • 869
  • 7
  • 14
  • 1
    Update to self: Two great apps now exist that can reliably transcribe. These includes (Otter Trancriber)[https://otter.ai/login] and [Google Live Transcribe](https://play.google.com/store/apps/details?id=com.google.audio.hearing.visualization.accessibility.scribe&hl=en_CA) -- you just press a button and it will automatically transcribe a multispeaker dinner-table or conference-room conversation. It's been a revolutionary change to me as a deaf software developer. – Mark Rejhon Jun 23 '19 at 18:22
  • I'm just a developer looking to add speech recognition to my website, but I concur that Otter Transcriber is a good option. However, I notice substantially different accuracy levels between using it on my desktop computer and my phone. On my desktop computer, the speech recognition is great -- on par with Google's recognition. On my phone, it's a lot worse. I'd explain it as my phone's mic just being bad, however the Google app does better on my phone, so it can't just be that. EDIT: Nvm, I just had my phone's mic volume low -- didn't know the volume-buttons controlled that when recording! – Venryx Jul 25 '19 at 02:14
  • 1
    Get a dual-mic phone. The new iPads have been amazing improvement in accuracy compared to the older single-mic iPads, while bigger screen than phones. Also, external Bluetooth mic options exist to help. Eventually API will probably exist but these apps are great, you can also mirror Otter concurrently in multiple web browsers, so once somone starts transcribing, the captions can show up in an IFRAME in theory, if you get the caption-stream-specific link (use the share button after you start recording). – Mark Rejhon Jul 26 '19 at 14:31
1

Google is not going to work with telephone quality audio anyway, you need to work on captioning service using CMUSphinx yourself.

You probably didn't configure CMUSphinx properly, it should be ok for large vocabulary transcription, the only thing you should care about is to use telephony 8khz model, not wideband model and generic language model.

For the best accuracy it's probably worth to move processing on the server, you can setup the PBX to make the calls and transcribe audio there instead of hoping to do something on a limited device.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Interesting. So I need to change the acoustic model of CMUSphinx to the telephony 8Khz one? Will try it and report back. – Lorteld Sep 12 '14 at 08:03
  • Try this http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Acoustic%20Model/en-us-8khz.tar.gz/download and this http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Language%20Model/cmusphinx-5.0-en-us.lm.dmp/download – Nikolay Shmyrev Sep 12 '14 at 08:08
  • 1
    Thank you for the comment. I tried using the generic acoustic model as well as the 8khz model and the recognition is poor for general speech. I am now looking into using sphinx4 on a server. – Lorteld Sep 23 '14 at 09:15
  • ok, for an accuracy advise you can share the audio and details about decoder setup you are using, that would be easier to figure out what is going then. – Nikolay Shmyrev Sep 23 '14 at 09:49
  • Actually, Google and Apple speech transcription works well enough for family members speaking in a quiet room -- some cellphones transmit audio in a higher-def manner (>8KHz), so it works well between two iPhones (iPhone speaker piped into another iPhone's Siri running in iOS8 continuous mode). – Mark Rejhon Oct 10 '14 at 18:58
0

It is true that the SpeechRecognizer API documentation claims that

The implementation of this API is likely to stream audio to remote servers to perform speech recognition. As such this API is not intended to be used for continuous recognition, which would consume a significant amount of battery and bandwidth.

This bit of text was added a year ago (https://android.googlesource.com/platform/frameworks/base/+/2921cee3048f7e64ba6645d50a1c1705ef9658f8). However, no changes were made to the API at the time, i.e. the API remained the same. Also, I don't really see anything specific to networking and battery drain in the API documentation. So, go ahead and implement a recognizer (maybe based on CMUSphinx) and make it accessible via this API.

Kaarel
  • 10,554
  • 4
  • 56
  • 78
  • Battery argument is moot. Phone calls also uses lots of battery power anyway, so why not let high-quality server-assisted realtime transcription use a similiar amount of power? – Mark Rejhon Oct 10 '14 at 20:27