There are many issues involved with your question "Is my approach correct or is there a better way to do this?". The most prominent are:
- Reading two different audio files and concatenate them
- Mixing audio files to one audio file
- Using audio as an input to a neural network (NN) (i.e. what form of input data shall be used)
- Type of NN to be used for the audio-related task
- The actual loss/task that the NN will be trained to do
- How do you verify that the an approach is better
I think that you are implying that you asking about 1 and 2, so I'm going to focus my answer to 1 and 2.
What you are showing could be a minimal working example only if:
marvin_audio.wav
and speak_audio.wav
are of the same sampling frequency
+
means concatenation, which is truly super non-intuitive for audio processing.
If not both of the above stand true, then you will have distorted audio.
If both stand true, then you will have first the audio of the first file and then the audio of the second.
There are few things that you can do and will not require expert field-knowledge. These would be:
- From your audio files, trim silence from beginning and end (silene = consecutive samples with maximum value below a threshold, e.g. -60 dB FS)
- Normalize the audio files so that both will have |1| as maximum value
- Add fade-in and fade-out at the beginning and end (respectively) of your silence-trimmed audio files
- Manually create a silence audio file (i.e. an audio file with all zeros as values for samples) with a duration defined by you, such as it will sound almost normal at the most combinations/concatenations of your audio files.
To have more control of what you are doing, I would recommend using some more dedicated audio processing library, like librosa.