Audio signal split at word level boundary

Question

I am working with audio file using webrtcvad and pydub. The split of any fragment is by silence of the sentence. Is there any way by which the split can be done at word level boundry condition? (after each spoken word)? If librosa/ffmpeg/pydub has any feature like this, can split is possible at each vocal? but after split, I need start and end time of the vocal exactly what that vocal part has positioned in the original file. One simple solution or way to split by ffmpeg is also defined by :

https://gist.github.com/vadimkantorov/00bf4fbe4323360722e3d2220cc2915e

but this is also splitting by silence, and with each padding number or the frame size, the split is different. I am trying split by vocal. As example, I have done this manually the original file, split words and its time position in json is in a folder provided here under the link:

www.mediafire.com/file/u4ojdjezmw4vocb/attached_problem.tar.gz

Is your input signal speech (not singing)? Can you provide an example audio file with indications of where you'd like the split points? — Jon Nordby, Oct 01 '20 at 21:44
I can provide you any example. as audio. I am interested in vocals, not specifically background music. So, even if there is any slight music. Like the audio has two sentences. Peter is an engineer. He works eight hours daily. Normally, the split is usually after a sentence as break of silence. But I want to split like "Peter", "is", "an", "Engineer". "He", "works", "eight", "hours", "daily". Ignore the comma this is just to explain. But, the splitted exported wav should also have the time of that vocal which it has in the original file. — ML85, Oct 02 '20 at 07:52
Ok then this is a Speech Segmentation task, at word boundaries — Jon Nordby, Oct 02 '20 at 08:31
Is there any posibility to do this with any library? if you can guide me how to do. — ML85, Oct 02 '20 at 08:46
If you can provide an example file with marks of expected boundaries, then maybe I (or someone) else can give some example code — Jon Nordby, Oct 03 '20 at 19:04
OK. added the data example with marks and info in text with json as well. — ML85, Oct 14 '20 at 15:11
Hi ML85. Not offhand, no. This would be better asked as a new question — Jon Nordby, Dec 29 '20 at 15:34

score 6 · Accepted Answer · answered Dec 19 '20 at 13:43

Simple audio segmentation problems can be handled by using a Hidden Markov Model, after preprocessing the audio into suitable features. Typical features for speech would be soundlevel, vocal activity / voicedness. To get word-level segmentation (as opposed to sentence), this needs to have rather high time resolution. Unfortunately the pyWebRTCVAD does not have adjustable time smoothening so it might not be suited for the task.

In your audio sample there is a radio host speaking rather quickly in German. Looking at the soundlevels wrt to the word boundaries you have marked it is clear that between some words the soundlevel doesnt really drop. That rules out a simple soundlevel segmentation model.

All in all, getting good results for general speech signals can be quite hard. But fortunately this is very well researched, and with off-the-shelf solutions being available. These use typically an acoustic model (how words and phonemes sound), as well as a language model (likely orders of words), learned over many hours of audio.

Word segmentation using Speech Recognition library

All these features are included in a Speech Recognition framework, and many allow to get word-level outputs with timing. Below is some working code for this using Vosk.

Alternatives to Vosk would be PocketSphinx. Or using an online speech recognition service from Google Cloud, Amazon Web Services, Azure Cloud etc.


import sys
import os
import subprocess
import json
import math

# tested with VOSK 0.3.15
import vosk
import librosa
import numpy
import pandas



def extract_words(res):
   jres = json.loads(res)
   if not 'result' in jres:
       return []
   words = jres['result']
   return words

def transcribe_words(recognizer, bytes):
    results = []

    chunk_size = 4000
    for chunk_no in range(math.ceil(len(bytes)/chunk_size)):
        start = chunk_no*chunk_size
        end = min(len(bytes), (chunk_no+1)*chunk_size)
        data = bytes[start:end]

        if recognizer.AcceptWaveform(data):
            words = extract_words(recognizer.Result())
            results += words
    results += extract_words(recognizer.FinalResult())

    return results

def main():

    vosk.SetLogLevel(-1)

    audio_path = sys.argv[1]
    out_path = sys.argv[2]

    model_path = 'vosk-model-small-de-0.15'
    sample_rate = 16000

    audio, sr = librosa.load(audio_path, sr=16000)

    # convert to 16bit signed PCM, as expected by VOSK
    int16 = numpy.int16(audio * 32768).tobytes()

    # XXX: Model must be downloaded from https://alphacephei.com/vosk/models
    # https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip
    if not os.path.exists(model_path):
        raise ValueError(f"Could not find VOSK model at {model_path}")

    model = vosk.Model(model_path)
    recognizer = vosk.KaldiRecognizer(model, sample_rate)

    res = transcribe_words(recognizer, int16)
    df = pandas.DataFrame.from_records(res)
    df = df.sort_values('start')

    df.to_csv(out_path, index=False)
    print('Word segments saved to', out_path)

if __name__ == '__main__':
    main()

Run the program with the .WAV file and the path to an output file.

python vosk_words.py attached_problem/main.wav out.csv

The script outputs words and their times in the CSV. These timings can then be used to split the audio file. Here is example output:

conf,end,start,word
0.618949,1.11,0.84,also
1.0,1.32,1.116314,eine
1.0,1.59,1.32,woche
0.411941,1.77,1.59,des

Comparing the output (bottom) with the example file you provided (top), it looks pretty good.

It actually picked up a word that your annotations did not include, "und" at 42.25 seconds.

Dear Jonnor. You have really taught me an excellent technique. I have no words to express thank you. The biggest lesson I learnt in 2020 before year end is this one to learn about VOSK. I would like to ask two things. As this is Apache 2.0, so its using should not be problem? and also, can I train my own model with vosk? I have seen some text about it one can customize and train engine with own data. — ML85, Dec 21 '20 at 10:41
VOSK seems to be Apache 2.0 yes. But it uses a bunch of other projects, so you may want to review those licensed as well. — Jon Nordby, Dec 21 '20 at 14:43
As for trainiing your own model with VOSK, I really do not know how to do that. That question is better directed to them — Jon Nordby, Dec 21 '20 at 14:44
In deed this is great learning for me from you. Thank you so much. !!! — ML85, Dec 21 '20 at 16:34

score 1 · Answer 2 · answered Oct 20 '20 at 10:17

Delimiting words is out of the audio domain and requires a kind of intelligence. Doing it manually is easy because we are intelligent and know exactly what we are looking for, but automatizing the process is hard because, as you already noticed, a silence is not (not only, not always) a word delimiter.

At audio level, we can only approach a solution and this require both analyzing the amplitude of the signal and adding some time mechanisms. As an example, Protools provides a nice tool named Strip Silence that cuts audio regions automatically based on the amplitude of the signal. It always keeps the material at its original position in the timeline and naturally each region knows its own duration. In addition to the threshold in dB, and to prevent creating too much regions, it provides several useful parameters in the time domain : a minimum length for the created regions, a delay before the cut (the delay is computed from the point the amplitude passes below the threshold), an inverted delay before reopening the gate (the delay is computed backward from the point the amplitude passes above the threshold).

This could be a good starting point for you. Implementing such a system probably won't be 100 % successful, but you could obtain a quite good ratio if the settings are well adjusted to the speaker. Even if it's not perfect, it will significantly reduce the need for manual work.

Audio signal split at word level boundary

2 Answers2

Word segmentation using Speech Recognition library

Linked