How to combine two audios and train them in machine learning

Question

I've to train a neural network using audio files.

I have an audio dataset that contains the folders with a person's name and commands. Suppose, one folder is 'Marvin', the name of the person becomes 'Mavin' and another folder 'speak', so the command is 'speak'. Now I want to have the audio files, in which the audio says, 'Marvin speak'.

Right now the approach I thought was to join the audio files in pydub library and train the neural network.

from pydub import AudioSegment
sound_marvin = AudioSegment.from_file('marvin_audio.wav')
sound_speak = AudioSegment.from_file('speak_audio.wav')
final = sound_marvin + sound_speak
final.export('final.wav', format='wav')

Is my approach correct or is there a better way to do this?

Any suggestions/ideas are welcome.

score 0 · Answer 1 · answered Dec 16 '19 at 07:50

There are many issues involved with your question "Is my approach correct or is there a better way to do this?". The most prominent are:

Reading two different audio files and concatenate them
Mixing audio files to one audio file
Using audio as an input to a neural network (NN) (i.e. what form of input data shall be used)
Type of NN to be used for the audio-related task
The actual loss/task that the NN will be trained to do
How do you verify that the an approach is better

I think that you are implying that you asking about 1 and 2, so I'm going to focus my answer to 1 and 2.

What you are showing could be a minimal working example only if:

marvin_audio.wav and speak_audio.wav are of the same sampling frequency
+ means concatenation, which is truly super non-intuitive for audio processing.

If not both of the above stand true, then you will have distorted audio.

If both stand true, then you will have first the audio of the first file and then the audio of the second.

There are few things that you can do and will not require expert field-knowledge. These would be:

From your audio files, trim silence from beginning and end (silene = consecutive samples with maximum value below a threshold, e.g. -60 dB FS)
Normalize the audio files so that both will have |1| as maximum value
Add fade-in and fade-out at the beginning and end (respectively) of your silence-trimmed audio files
Manually create a silence audio file (i.e. an audio file with all zeros as values for samples) with a duration defined by you, such as it will sound almost normal at the most combinations/concatenations of your audio files.

To have more control of what you are doing, I would recommend using some more dedicated audio processing library, like librosa.

How to combine two audios and train them in machine learning

1 Answers1