1

I'm trying to make a little script to generate some visualization of audio files in python. My ultimate goal is to generate a 30fps video made up from images collated generated in python assembling some image assets. But I'm a bit stuck with the sound analysis part, because I know almost nothing about sound and the physics and math behind it. Anyways. To handle importing sound in python I used librosa, which seems very powerful.

import librosa
import numpy as np

#load an example audio
filename = librosa.ex('trumpet')
y, sr = librosa.load(filename, sr=44100)
#apply short-time Fourier transform
stft = np.abs(librosa.stft(y, n_fft=1024, hop_length=None, win_length=None, window='hann', center=True, dtype=None, pad_mode='reflect'))
# converting the matrix to decibel matrix
D = librosa.amplitude_to_db(stft, ref=np.max)

This way I obtain a matrix D of shape (513,480), meaning 513 "steps" in the frequency range and 480 data points, or frames. Given that the duration of the sample is about 5.3 seconds this makes it about 172.3 frames per second. Because I want to have 30fps I decided to do some trial and error on the sample rate till I got to 23038. Applying this when loading the file:

y, sr = librosa.load(filename, sr=23038)

I obtain a frame rate of 89.998 fps, that seems to be more usable for my purposes. So I went on and reduced the resolution of my data averaging the readings I had in 9 "buckets" for the frequencies:

#Initialize new empty array of target size
nD = np.empty(shape=(9,D.shape[1]))
#populate new array
for i in range(D.shape[1]):
    nD[:,i] = D[:,i].reshape(-1, 57).mean(axis=1)

The 9 here comes from the fact that is a number I can obtain dividing by 57, which is a factor of 513. Consequentially I aggregated 3 frames together to get to my 30fps:

count = 0
for i in range(0, nD.shape[1],3):
    try:
        nnD[:,count] = nD[:,i:i+3].reshape(-1, 3).mean(axis=1)
        count+=1
    except Exception:
        pass

This seems totally hacky, the try except part is there because sometimes I have index errors, and counting incrementing a counter seems stupid enough. But incredibly enough seems to work somehow. For testing purposes I made a visualization routine

from time import sleep
from os import system
nnD = ((nnD+80)/10).astype(int)
for i in range(nnD.shape[1]):
    for j in range(nnD.shape[0]):
        print ("#"*nnD[j,i])
    sleep(1/30)
    system('clear')

That shows rows made of # in the terminal.

Now, my question is:

How can I make this the proper way?

More specifically:

1_is there a way to match the frame rate of the Fourier data without hacking the sample rate?

2_is there a more proper way to aggregate my data, possibly to an arbitrary number instead of having to choose between a factor of 513?

Hirabayashi Taro
  • 933
  • 9
  • 17
  • What you really want is to tweak the [`stft()` call's `n_fft` and `hop_length`](https://librosa.org/doc/latest/generated/librosa.stft.html#librosa.stft) to adjust the number of "instants" in time the analysis comprises. – AKX Apr 13 '21 at 13:58
  • @AKX I thought about that, but I wasn't able to figure out what a "reverse formula" can be. – Hirabayashi Taro Apr 13 '21 at 14:16

2 Answers2

2

If you want the time-resolution of your in seconds to be approximately N FPS, you can do something like the below code. But note that since the hop needs to be an integer amount, this will lead to drift, see the printed output. Resetting the synchronization regularly is thus necessary, for example once per 1 minute. Or maybe one can get away with once per song.

import math

def next_power_of_2(x):
    return 2**(math.ceil(math.log(x, 2)))

def params_for_fps(fps=30, sr=16000):
    frame_seconds=1.0/fps
    frame_hop = round(frame_seconds*sr) # in samples
    frame_fft = next_power_of_2(2*frame_hop)
    rel_error = (frame_hop-(frame_seconds*sr))/frame_hop
    
    return frame_hop, frame_fft, rel_error


seconds = 10*60
fps = 15
sr = 16000
frame_hop, frame_fft, frame_err = params_for_fps(fps=fps, sr=sr)
print(f"Frame timestep error {frame_err*100:.2f} %")
drift = frame_err * seconds
print(f"Drift over {seconds} seconds: {drift:.2f} seconds. {drift*fps:.2f} frames")

# Then variables can be used with
# librosa.stft(...hop_length=frame_hop, n_fft=frame_fft)

If this approach is not good enough, one needs to do interpolation on the audio features, based on the (video) frame counter. Linear interpolation will do fine. That allows to compensate for associated drift. This can be done dynamically for each frame, or one can resample the audio time-series to be aligned with FPS frames.

Jon Nordby
  • 5,494
  • 1
  • 21
  • 50
1

This does the trick nicely enough. As I suspected, you need to tweak hop_length.

import time

import librosa
import numpy as np
import scipy.signal

# Tunable parameters
hop_length_secs = 1 / 30
bands = 10  # How many frequency bands?
characters = " ..::##@"  # Characters to print things with

filename = librosa.ex("trumpet")
y, sr = librosa.load(filename, sr=22050)
sound_length = y.shape[0] / sr
print(f"{sound_length = }")
hop_length_samples = int(hop_length_secs * sr)
print(f"{hop_length_secs = }")
print(f"{hop_length_samples = }")

stft = np.abs(librosa.stft(y, n_fft=1024, hop_length=hop_length_samples))
num_bins, num_samples = stft.shape

# This should be approximately `sound_length` now
print(f"{num_samples * hop_length_secs = }")

# Resample to the desired number of frequency bins
stft2 = np.abs(scipy.signal.resample(stft, bands, axis=0))
stft2 = stft2 / np.max(stft2)  # Normalize to 0..1


# Remap the 0..1 signal to integer indices
# -- the square root boosts the otherwise lower signals for better visibility.
char_indices = (np.sqrt(stft2) * (len(characters) - 1)).astype(np.uint8)

# Print out the signal "in time".
for y in range(num_samples):
    print("".join(characters[i] for i in char_indices[:, y]))
    time.sleep(hop_length_secs)

The output is

sound_length = 5.333378684807256
hop_length_secs = 0.03333333333333333
hop_length_samples = 735
num_samples * hop_length_secs = 5.366666666666666

.:..
.##:.. .
.#:: . .
.#:. .
.#:.
.#:. . .
....
.@#:.. ...
.##:.. .
.#:. .
.#:.
.::.
.#:.
.::.
.::.
.::.
....
.::.
.##:.. .
.##:.. .
.#:. .
.#:.
.::.
.:..
.:..
.:..
.:..
.##:..
.##:.. .
.##:.. .
.##:..
.:..

(etc...)

which, if you squint, does look like a visualization...

AKX
  • 152,115
  • 15
  • 115
  • 172