0

So although it's still a little shocking to me, Google's default speech recognition completely and totally ignores music/ambient noise. The problem is, for my use case I want it to actually try to transcribe the music!

I'm using the Web Speech API in Chrome 72 with the demo they have.

  • I can't get it to pick up things said from music at all, even when I place the speaker next to the mic.

  • I also can't get it to pick up any Youtube Videos or videos playing from online.

  • It also doesn't pick up anything my Alexa says.

  • I have an Android so I'm assuming they're doing something similar to Amazon in commercials by playing an unhearable sound that they use to cancel out the recording? Is there any way to disable this?

  • It also doesn't work if I play music from my Mac or PC directly.

  • It however DOES transcribe if I video chat someone (using WebRTC if that matters) and they say something which is played through the speakers.

For anyone wondering, I want it to transcribe a video that is playing on the same page of a human speaking with no background music. I'm using their demo code to see if this is viable.

Is there any way to recognize these sounds?

To clarify, I'm asking specifically how to disable this for the Web Speech API and not in general for speech recognition.

The Web Speech API is a very specific way to request speech recognition from the browser itself (in Chrome it goes to Google, in Firefox I believe they have a native solution).

There's more info on it here: https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API but it lacks documentation as it varies across browsers, and I am specifically asking to avoid this in Chrome.

E_net4
  • 27,810
  • 13
  • 101
  • 139
cuuupid
  • 912
  • 5
  • 20
  • 1
    Are you using the built-in `webkitSpeechRecognition`? _"It also doesn't pick up anything my Alexa says."_ What do you mean by _"my Alexa"_? – guest271314 Feb 11 '19 at 22:18
  • 1
    This question is not "too broad". – guest271314 Feb 11 '19 at 22:19
  • @guest271314 yes I am using the builtin with `const SR = window.SpeechRecognition || window.webkitSpeechRecognition` to support as many browsers as possible – cuuupid Feb 11 '19 at 22:21
  • 1
    Note that `webkitSpeechRecognition` records the audio input to the microphone and sends that data to a remote service. The actual code that performs the speech recognition is not shipped with Chromium or Chrome source code. – guest271314 Feb 11 '19 at 22:23
  • Yes, so it seems that I'm not getting this behaviour with Firefox as much, but the goal is to get it to take into account videos/music on Chrome. I've been unable to find any docs on Chrome's implementation, are there any options I can pass to the request to signify that ambient noise shouldn't be ignored? I know in Python when using Google's SR you can pass `.ignore_ambient_noise` to do the opposite – cuuupid Feb 11 '19 at 22:27
  • 1
    No, the W3C Web Speech API specification does not provide a default means to process music. Developers have no control over how the captured audio is processed by the remote service or the transcript returned from the remote service. The fact that user biometric data is recorded and sent to a remote service is not document outside of a bug report. You might be interested in the open source projects Tensorflow and CMU Pocket Sphinx. – guest271314 Feb 11 '19 at 22:28
  • 1
    I see, thank you, I'll pursue a more native solution then. – cuuupid Feb 11 '19 at 22:31
  • 1
    _"I'm a little surprised that this has already been downvoted and marked as too broad; to clarify I'm asking specifically how to disable this for the Web Speech API and not in general for speech recognition."_ Do not worry about that. The question is not "too broad". The W3C Speech API specification has been published for a fair amount of time. And implemented at Chromium/Chrome for some time as well. The votes to close this question citing the reason "too broad" makes no sense. – guest271314 Feb 11 '19 at 22:51
  • Is it only for your own use or should it work for anyone visiting your website? Routing audio-out to default audio-in using something like [loopback](https://rogueamoeba.com/loopback/) digital devices, It succeeded in transcribing a tedX video from YT, but not from a song. – Kaiido Feb 12 '19 at 08:37
  • Unfortunately it must work for anyone using the site! Upon doing some more research it seems Google's speech recognition model is trained to ignore ambient noise to begin with so the only way to achieve this currently is using a custom model. The question's been closed as too broad already but the accepted answer is definitely the correct answer. – cuuupid Feb 12 '19 at 20:09

1 Answers1

2

Note that webkitSpeechRecognition records the audio input to the microphone and sends that data to a remote service. The actual code that performs the speech recognition is not shipped with Chromium source code (which Chrome is built from).

The W3C Web Speech API specification does not provide a default means to process ambient noise/music. At Chromium/Chrome browsers developers have no control over how the captured audio is processed by the remote service or the transcript returned from the remote service. The fact that user biometric data is recorded and sent to a remote service is not documented outside of at least one Chromium bug report marked WON'T FIX, and issues filed at GitHub.

You might be interested in the open source projects Tensorflow and CMU Pocket Sphinx, where you can create your own models. Mozilla Voice Web contains a substantial amount of data that can be used for training TTS/STT models.

guest271314
  • 1
  • 15
  • 104
  • 177