Sound recognition/Keyword spotting

Question

I'd like to know what the best solution would be for my problem.

We are currently looking to do keyword spotting without using Speech to Text / Keyword recognition due to accents and dialects.

We would like to listen to sound files that can be quite long, and then run it against a list of keywords to determine if those keywords exist. We can also do model training for those keywords to train our accents to potentially fit those models.

What would the best solution to this be? My boss' idea is to find similarity in a spectrograph, but I'm just not sure what the most effective way to approach this issue would be.

We mainly work in C# but willing to use any language to best solve our issue.

I tried using PocketSphinx but could not get that working properly, as it seems to still try do Speech to Text which wont work well, as our country has 11 languages each with different accents.

Stack Overflow's scope is limited to _narrow, specific_ questions; very broad "where do I start?" questions don't generally fit the format. — Charles Duffy, Aug 25 '23 at 15:09

score 0 · Answer 1 · answered Aug 25 '23 at 15:02

First up, I'm going to make a couple of assumptions:

The only country I'm aware of that has 11 official languages is South Africa, and if that's your use case then you're looking at trying to do keyword spotting not just across different languages - Afrikaans, English, Setswana, Kiswahili, Xhosa, isiZulu - but across language families - Bantu, Khoisan, Indo-European.
The reason that you don't want to do keyword spotting is that most keyword spotting models are single-language based - and you want to be able to do keyword spotting across your eleven languages.
There is a hidden requirement here that the keyword spotting has to identify words in many languages, dialects and accents.

The way I would approach this problem is to train a keyword spotting model using data from the eleven different languages. The model doesn't care what language is in the model - it cares whether the distribution in the model is similar to the distribution of the deployment environment - that is, the language and accents where you want to use the model.

At a fundamental level, keyword spotting requires speech to text - because the speech to text has to "recognise" the word to predict whether the word is a hot word or not. But the word has to be in the training data to be recognised.

score 0 · Answer 2 · answered Aug 28 '23 at 14:07

I faced a similar issue with accents, my approach was to use spectrographs. I trained a tf-mobilenet classification model using spectrographs [recorded audio files converted to spectrographs]. For testing, the user's voice command is recorded and the processed[can be avoided if using a high directional microphone] and converted to spectrograph. This image was given as input to the model.

I used python for both conversion and training. Try vosk or silero-stt models, for speech-text. These give accurate results compared to deepspeech and pocket-sphinx. Even snowboy is a good option.

Sound recognition/Keyword spotting

2 Answers2