I'm trying to make a little script to generate some visualization of audio files in python. My ultimate goal is to generate a 30fps video made up from images collated generated in python assembling some image assets. But I'm a bit stuck with the sound analysis part, because I know almost nothing about sound and the physics and math behind it. Anyways. To handle importing sound in python I used librosa, which seems very powerful.
import librosa
import numpy as np
#load an example audio
filename = librosa.ex('trumpet')
y, sr = librosa.load(filename, sr=44100)
#apply short-time Fourier transform
stft = np.abs(librosa.stft(y, n_fft=1024, hop_length=None, win_length=None, window='hann', center=True, dtype=None, pad_mode='reflect'))
# converting the matrix to decibel matrix
D = librosa.amplitude_to_db(stft, ref=np.max)
This way I obtain a matrix D
of shape (513,480)
, meaning 513 "steps" in the frequency range and 480 data points, or frames. Given that the duration of the sample is about 5.3 seconds this makes it about 172.3 frames per second.
Because I want to have 30fps I decided to do some trial and error on the sample rate till I got to 23038
. Applying this when loading the file:
y, sr = librosa.load(filename, sr=23038)
I obtain a frame rate of 89.998 fps, that seems to be more usable for my purposes. So I went on and reduced the resolution of my data averaging the readings I had in 9 "buckets" for the frequencies:
#Initialize new empty array of target size
nD = np.empty(shape=(9,D.shape[1]))
#populate new array
for i in range(D.shape[1]):
nD[:,i] = D[:,i].reshape(-1, 57).mean(axis=1)
The 9 here comes from the fact that is a number I can obtain dividing by 57, which is a factor of 513. Consequentially I aggregated 3 frames together to get to my 30fps:
count = 0
for i in range(0, nD.shape[1],3):
try:
nnD[:,count] = nD[:,i:i+3].reshape(-1, 3).mean(axis=1)
count+=1
except Exception:
pass
This seems totally hacky, the try except part is there because sometimes I have index errors, and counting incrementing a counter seems stupid enough. But incredibly enough seems to work somehow. For testing purposes I made a visualization routine
from time import sleep
from os import system
nnD = ((nnD+80)/10).astype(int)
for i in range(nnD.shape[1]):
for j in range(nnD.shape[0]):
print ("#"*nnD[j,i])
sleep(1/30)
system('clear')
That shows rows made of # in the terminal.
Now, my question is:
How can I make this the proper way?
More specifically:
1_is there a way to match the frame rate of the Fourier data without hacking the sample rate?
2_is there a more proper way to aggregate my data, possibly to an arbitrary number instead of having to choose between a factor of 513?