0

I'm currently implementing Vosk Speech recognition into an application. Looking specifically at the speaker recognition, I've implemented the test_speaker.py from the examples and it is functional. Being new to this, how can I identify and/or create the reference speaker signature? Using the one provided, the list of distances calculated with my audio example doesn't portray the two speakers involved:

[1.0182311997728735, 0.8679279016022726, 0.8552687907177629, 1.0258941854519696, 0.8666933753723253, 0.9291881495586336, 1.0316585805917928, 1.0227699471036409, 0.8442800102809634, 0.9093189414477789, 0.9153723223264221, 0.9705387223260904, 0.9077720598812595, 0.9524431272217568, 0.9179475137290445]

If there is not an effective way to calculate a reference speaker from within the audio under analysis, do you know of another solution that can be used with Vosk to identify speakers in an audio file? If not, what other speech to text option would you suggest? (I've already played with google's)

Thanks in advance

rafadevi
  • 11
  • 1
  • 3

1 Answers1

0

I've been working with Vosk recently as well, and the way to create a new reference speaker is to extract the X-Vector output from the recognizer.

This is code from the python example that I adapted to put each utterance's X-Vector into a list called "vectorList".

    if recognizer.AcceptWaveform(data):
        res = json.loads(recognizer.Result())
        # print("Text:", res['text'])
        # Checks that X-Vector ('spk') is in the data file, res
        if 'spk' in res:
            # Append X-Vector to baseline list
            vectorList.append(res['spk'])

In my program, I then use these vectors in the vector list as the reference speakers that are compared with other x-vectors in the cosine_dist function. The cosine_dist function returns a "speaker distance" that tells you how different the two x-vectors were.

In summary the program I'm developing does the following:

  • Runs some "baseline" audio files through the recognizer to get the x-vectors
  • Store the x-vectors in a list
  • Run some testing audio files through the recognizer to get x-vectors to test with
  • Run each test x-vector against each "baseline" x-vector with the cosine_dist function
  • Average the speaker distances returned from cosine_dist to get the average speaker distance

I'm no expert with Vosk, I should mention, and it is entirely possible there is a better way to go about this. This is just the way I've found to do it, based off of the example problem in the python directory.

  • 1
    Is this working well for you? We have tried this approach but the speakers it returns is not accurate enough for us. Is there something else we can try to improve the accuracy? – John Pollard Mar 08 '22 at 21:37
  • 1
    We eventually moved away from using Vosk all together for speaker recognition. We found it rather inaccurate and couldn't be relied on. It may be a good idea to have multiple "baseline" vectors to compare against, however, we decided not to pursue it any further. We've found Tensor Flow and Keras highly promising however. Here's an article that demonstrates speaker classification well: https://towardsdatascience.com/voice-classification-with-neural-networks-ff90f94358ec – Aaron Walker Mar 09 '22 at 22:07