Converting Python Speech Recognition Audio Frame data to a numpy array that can be processed by Whisper?

Question

I'm using the speech recognition Python library to record audio bytes from my microphone in mono at 16khz but I want to use the new Whisper library that accepts NumPy arrays, spectrograms, and file paths. Writing to a file takes too long so I'd like to directly convert the data to an array to pass it to Whisper.

Datagniel · Answer 1 · 2022-10-28T16:27:51.733

Here is a solution to your problem:

Assuming your code goes like

with sr.Microphone(device_index=device_index, sample_rate=16000) as source:
    r = sr.Recognizer()
    audio = r.listen(source, timeout=None)

you need to convert the audio data (the output of your Recognizer.listen) to wave format 1

audio_data = audio.get_wav_data()

which can be converted to an array of int16 2

data_s16 = np.frombuffer(audio_data, dtype=np.int16, count=len(audio_data)//2, offset=0)

which can then be converted to an array of float32 3

float_data = data_s16.astype(np.float32, order='C') / 32768.0

which can then be processed by whisper. If there is a faster way (maybe a combination of 2 and 3), let me know.

Greetings

score 0 · Answer 2 · answered Oct 25 '22 at 13:09

0

try librosa library

librosa.load(path, *, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='soxr_hq')

link librosa load function

answered Oct 25 '22 at 13:09

Mohamed Salama

1
1

Converting Python Speech Recognition Audio Frame data to a numpy array that can be processed by Whisper?

2 Answers2