There isn't a great answer to this, but your best bet for offline speech recognition at the moment (Aug, 2023) is using an implementation of OpenAI's Whisper model, compiled to WebAssembly. There are three that I know of:
- ggerganov's whisper.cpp
- xenova implementation on transformers.js
- HuggingFace's implementation on Candle
Note this still isn't a great option for a few reasons:
- download size: because it isn't built into the browser, it requires the browser to download a large model file (absolute minimum of 31 MB for the quantized "tiny" model)
- quality: there's a pretty direct tradeoff between model size and quality. Only the tiniest quantized models are even close to reasonable to load in most webpages, and you're not going to get top notch results with these. Even the largest models that you can possibly load into a browser (likely small, maybe medium quantized, or, if you're really brave/masochistic, perhaps try the "large" quantized... but it's over 1GB) aren't going to be as good as the large unquantized models that can only reasonably run on a server. And even if you do get these models loaded, they're going to be lacking...
- Inference speed: the tiny model can just barely keep up with real-time transcription on a relatively new/powerful laptop or desktop (it doesn't quite keep up on my older X1 Carbon gen 7 laptop). It will likely lag significantly on most mobile devices. And larger models will be even slower. This, for me, is the biggest problem. Try it out for yourself with ggerganov's stream demo
- Complexity: Getting any of these up and running in your own project is not entirely straightforward, and generally much lower level than the Web Speech Recognition API. For example, the core of the transformers.js implementation, which seems to be the simplest, is over 100LOC (and this just handles pre-recorded files, not real-time transcription).
Part of the added complexity is because these types of models generally work in chunks. For longer audio files, and especially for real-time transcription, we want a continuous stream of audio to produce a continuous stream of output text. The Web Speech Recognition API handles that for you, while with Whisper you have to do the chunking yourself (and deal with things like window overlaps or corrected transcriptions of already-seen words).
There is a good description of some of these issues with using lower-level speech recognition model APIs in the README of Google's Open Source Live Transcribe Speech Engine.[1]
All that to say, it would be really nice if we could just use the Web Speech Recognition API itself, with an offline browser-native model, but I haven't seen any recent movement in that direction.[2][3] If you can accept the limitations, Whisper might be a workable alternative (and if you want to make a Web Speech API polyfill, I'm sure it would be very much appreciated!)
[1]: In the announcement post for that library, Google recognized the complications in using an online system. Unfortunately, despite the name, this project isn't actually what I'd really call a "Live Transcribe Speech Engine", but instead a library to do live transcription using Google's cloud transcription API.
[2]: in fact, Chrome does ship a library to do offline transcription called libSODA (Speech On-Device), but it was initially released for the Live Caption feature, and seems to still not be used for the user-facing voice-to-text. Not so surprisingly, "the Speech team was concerned about unauthorized repurposing of their components", so not general availability for speech to text usage is something we can expect in the near future.
[3]: At one point Mozilla was building a speech to text engine called DeepSpeech to embed in Firefox, but apparently dropped development. Some former members of the DeepSpeech team forked the project and continued to the work for a while as Coqui AI STT, but have since retired that effort and recommend using Whisper instead.