2

I am currently working on training a classifier with PyTorch and torchaudio. For this purpose I followed the following tutorial: https://towardsdatascience.com/audio-deep-learning-made-simple-sound-classification-step-by-step-cebc936bbe5

This all works like a charm and my classifier is now able to successfully classify .wav files. However I would like to turn this into a real-time classifier, that is able to also classify recordings from a microphone/loopback input.

For this I would hope to not have to save a recording into a .wav file to load it again but instead directly feed the classifier with an in memory recording.

The tutorial uses the .load function of torchaudio to load a .wav file and return a waveform and sample rate as follows:

sig, sr = torchaudio.load(audio_file)

Now loopback is pretty much required and since pyaudio does apparently not support loopback devices yet (except for a fork that is very likely to be outdated) I stumbled across soundcard: https://soundcard.readthedocs.io/en/latest/

I found this code to yield a recording of my speaker loopback:

speakers = sc.all_speakers()
# get the current default speaker on your system:
default_speaker = sc.default_speaker()

# get a list of all microphones:v
mics = sc.all_microphones(include_loopback=True)

# get the current default microphone on your system:
default_mic = mics[0]

with default_mic.recorder(samplerate=148000) as mic, \
            default_speaker.player(samplerate=148000) as sp:
    print("Recording...")
    data = mic.record(numframes=1000000)
    print("Done...Stop your sound so you can hear playback")
    time.sleep(5)
    sp.play(data)

However now of course I don't want to play that audio with the .play function but instead pass it onto to torchaudio/the classifier. Since I am new to the world of audio processing I have no idea how to get this data into a suitable format similar to the one returned by torchaudio. According to the docs of soundcard the data has the following format:

The data will be returned as a frames × channels float32 numpy array

As a last resort maybe saving it into an in memory .wav file and then reading it with torchaudio is possible? Any help is appreciated. Thank you in advance!

Jalau
  • 303
  • 1
  • 2
  • 11

1 Answers1

2

According to the doc, you will get a numpyarray of shape frames × channels. For a stereo microphone, this will be (N,2), for mono microphone (N,1).

This is pretty much what the torch load function outputs: sig is a raw signal, and sr the sampling rate. You have specified your sample rate yourself to your mic (so sr = 148000), and you just need to convert your numpy raw signal to a torch tensor with:

sig_mic = torch.tensor(data)

Just check that the dimensions are similar, it might be something like (2,N) with torchaudio.load(), in such case, just reshape the tensor:

sig_mic = torch.tensor(data).reshape((2, -1))
PlainRavioli
  • 1,127
  • 1
  • 1
  • 10
  • Hey there! Thank you that was very helpful. I can't say whether it is the expected format/fully working but it worked without errors. I had to reshape as you already stated. Furthermore I ran into the error that caused it to complain about the Tensor not being a scalar double so I had to call sig_mic = sig_mic.float() to convert it from Float64 to the expected type. – Jalau Sep 20 '22 at 14:07
  • 1
    Glad it helped ! Don't forget to solve the topic :) – PlainRavioli Sep 20 '22 at 14:09
  • Yes, I was checking it out and maybe I am just stupid but sadly it seems to not work as expected since if I feed it with a flac extract it predicts it correctly, if I play the same flac extract and record it via the loopback it predicts the wrong result. Do you have any clue what the issue could possible be? Could it be the preprocessing done that somehow is different for the mic? My current code: https://hastebin.com/aponodanom.py – Jalau Sep 20 '22 at 14:28
  • 1
    This is a much different issue than this one. You should create a dedicated topic, but I guess your model only trained on clean recording maybe, and doesn't generalize well to mic recordings with noise/artefacts ? – PlainRavioli Sep 20 '22 at 14:32
  • Okay, I just saved the file and listened to it and it sounds horrible. Like slowed down and further messed up or something. But using the .play from soundcard it still sounds fine. So it must be happening in the conversion to an torchaudio array. https://hastebin.com/kiwecudafe.py This is the code I am using for that. – Jalau Sep 20 '22 at 14:33
  • 1
    Did you use the same `sr` than when recording ? – PlainRavioli Sep 20 '22 at 14:35
  • Yes I did. Debug also shows that (I changed sample rate to 48000 because that is what my PC says the output runs at). So recording uses 48000 and the saving as well. Log prints: Recording with sr 48000 for 4 seconds... Done...Stop your sound so you can hear playback Model loaded Conversion samplerate is 48000 Saved samplerate is: 48000 Edit: Saving a wav again after using the load function works perfectly fine. It must probably be something happening / missing during conversion. One can guess the original sound but it is definitely slowed down and maybe further altered. – Jalau Sep 20 '22 at 14:38
  • Ah maybe the float precision is a lot worse than the original recording. What is the type inside the original numpy recording ? – PlainRavioli Sep 20 '22 at 14:45
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/248202/discussion-between-jalau-and-plainravioli). – Jalau Sep 20 '22 at 14:46