3

I am running librosa.pyin on a speech audio clip, and it doesn't seem to be extracting all the fundamentals (f0) from the first part of the recording.

librosa documentation: https://librosa.org/doc/main/generated/librosa.pyin.html

sr: 22050

fmin=librosa.note_to_hz('C0')
fmax=librosa.note_to_hz('C7')

f0, voiced_flag, voiced_probs = librosa.pyin(y,
                                             fmin=fmin,
                                             fmax=fmax,
                                             pad_mode='constant',
                                             n_thresholds = 10,
                                             max_transition_rate = 100,
                                             sr=sr)

Raw audio:

raw audio

Spectrogram with fundamental tones, onssets, and onset strength, but the first part doesn't have any fundamental tones extracted.

link to audio file: https://jasonmhead.com/wp-content/uploads/2022/12/quick_fox.wav

times = librosa.times_like(o_env, sr=sr)
onset_frames = librosa.onset.onset_detect(onset_envelope=o_env, sr=sr)

enter image description here

Another view with power spectrogram:

enter image description here

I tried compressing the audio, but that didn't seem to work.

Any suggestions on what parameters I can adjust, or audio pre-processing that can be done to have fundamental tones extracted from all words?

What type of things affect fundamental tone extraction success?

jmhead
  • 887
  • 1
  • 12
  • 25

1 Answers1

3

TL;DR It seems like it's all about the parameters tweaking.

Here are some results that I've got playing with the example, it would be better to open it in a separate tab: some graphs The bottom plot shows a phonetic transcription (well, kinda) of the example file. Some conclusions I've made to myself:

  1. There are some words/parts of a word that are difficult to hear: they have low energy and when listening to them alone it doesn't sound like a word, but only when coupled with nearby segments ("the" is very short and sounds more like "z").
  2. Some words are divided into parts (e.g. "fo"-"x").
  3. I don't really know what should be the F0 frequency when someone pronounces "x". I'm not even sure that there is any difference in pronunciation between people (otherwise how do cats know that we are calling them all over the world).
  4. Two-seconds period is a pretty short amount of time.

Some experiments:

  • If we want to see a smooth F0 graph, going with n_threshold=1 will do the thing. It's a bad idea. In the "voiced_flag" part of the graphs, we see that for n_threshold=1 it decides that each frame was voiced, counting every frequency change as activity.
  • Changing the sample rate affects the ability to retrieve F0 (in the rightmost graph, the sample rate was halved), as it was previously mentioned the n_threshold=1 doesn't count, but also we see that n_threshold=100 (which is a default value for pyin) doesn't produce any F0 at all.
  • Top most left (max_transition_rate=200) and middle (max_transition_rate=100) graphs show the extracted F0 for n_threshold=2 and n_threshold=100. Actually it degrades pretty fast, and n_threshold=3 looks almost the same as n_threshold=100. I find the lower part, the voiced_flag decision plot, has high importance when combined with the phonetics transcript. In the middle graph, default parameters recognise "qui", "jum", "over", "la". If we want F0 for other phonems, n_threshold=2 should do the work.
  • Setting n_threshold=3+ gives F0s in the same range. Increasing the max_transition_rate adds noice and reluctancy to declare that the voice segment is over.

That's my thoughts. Hope it helps.

griko
  • 126
  • 4
  • Thanks! Is there any audio pre-processing that you think would help for f0 extraction on speech audio? – jmhead Dec 20 '22 at 12:36
  • It depends on the task. Maybe, the extracted F0s are already enough. From what I've seen, usually, when talking about F0 for speaker recognition task, it is common to take only high-energy segments (which can be done using VAD - voice activity detection, some of the popular libraries are sileroVAD, pyannoteVAD) and then taking the mean of F0s as a feature. Again, 2 seconds is a short audio sample by itself, applying VAD to it may be too much. Also, it will not add F0s but shorten the file, which will be filled with F0s. – griko Dec 20 '22 at 13:32
  • From the plots, looks like for most relevant parameters, the mean F0 will be about the same value. And if you are looking to feed them to LSTM, there are also enough to be considered as a temporal input. – griko Dec 20 '22 at 13:34