2

I have tried a couple of different approaches to generating a grayscale PNG of a DTFT of a video file but my results don't look anything like what other people are posting. Instead of plotting to the screen with matplotlib (as all examples tend to do), I am trying to create a PNG with scikit-image (which I trust after using it in other projects).

This is my code, which requires an mp4 file (here I'm using Aphex Twin's song with the notorious Demon Face hidden in the DTFT). I'm specifically interested in getting this working with the av library and I am sufficiently convinced that it is reading the file correctly and producing a numpy array of floating point numbers.

import numpy as np
import av
import skimage.io
import scipy.signal as signal

container = av.open("tmp/aphex.mp4")
frames = container.streams.audio[0].frames
chunk = container.streams.audio[0].frame_size // 2  # bug in av?
rate = container.streams.audio[0].rate

fltp = np.zeros((frames, chunk), dtype=float)
for n, frame in enumerate(container.decode(audio=0)):
    fltp[n, :] = np.frombuffer(frame.planes[0], dtype=float)    
fltp = fltp.flatten()    
# check that it worked by playing it (just one channel)
fltp.tofile("tmp/aphex.raw")
# play -t raw -r 44100 -e floating-point -b 32 -c 1 tmp/aphex.raw

custom_chunk = 4096
freqs, times, data = signal.spectrogram(fltp,
                                        fs=rate,
                                        nperseg=custom_chunk,
                                        detrend="linear")

data = data - np.min(data)
data = data / np.max(data)
data = 1 - data    
skimage.io.imsave("tmp/aphex.png", data)

But this produces a very sparse image (this is an image, honest, not a big vertical space) a very sparse image and if I add the following lines

data = np.log10(data)
data[data == -np.inf] = 0

to introduce a log scale (as many do) then it looks even weirder looks even weirder (cropped because of 2MB upload limit)

I've tried lots of other things, like trying to normalise per column (which looks slightly better, but still weird) but my spectrograms still look nothing like they are supposed to.

Does anybody know what I'm doing wrong?

(I also tried using numpy.fft.rfft / scipy directly on each chunk... my images looked pretty much the same as these ones. I've also tried a few different movies / songs)

fommil
  • 5,757
  • 8
  • 41
  • 81
  • 1
    You almost certainly want to detrend each chunk of data when computing the spectrogram, otherwise the DC component will be enormous. – ali_m Jun 13 '17 at 19:40
  • 1
    Please take a look at https://github.com/elegant-scipy/elegant-scipy/blob/master/markdown/ch4.markdown – Stefan van der Walt Jun 13 '17 at 21:27
  • In the development version of scipy, the [docstring for `scipy.signal.chirp`](http://scipy.github.io/devdocs/generated/scipy.signal.chirp.html) contains some examples of displaying chirp signals using [`scipy.signal.spectrogram`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.spectrogram.html). Maybe you can get something like that working for the audio channel of the mp4 file. – Warren Weckesser Jun 14 '17 at 00:04
  • 2
    When you play `aphex.raw` using the `play` command, does it sound correct? When you create a numpy array with `dtype=float`, the data type is 64 bit floating point (also known as double precision), so it is surprising that you use `-b 32` in the `play` command. – Warren Weckesser Jun 14 '17 at 00:11
  • Thanks @ali_m I'll try that... it sounds sensible. Stefan/Warren, I read those docs but they don't add anything more than what I have already done and understand (infact the chirp examples are doing exactly what I said I don't want to do, plot to the screen), did I miss some subtle paragraph? – fommil Jun 14 '17 at 08:40
  • @ali_m I swapped to https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.signal.spectrogram.html and tried constant/linear detrend, but it hasn't made a difference. Question update to reflect current status. – fommil Jun 14 '17 at 09:18
  • @WarrenWeckesser yes it sounds correct. I'm also confused why it needs to be 32 and not 64. When I plot it as a normal plot I can see some huge volume spikes so I wonder if the act of reading this way and then writing is masking a formatting error that is picked up in the python code. I'll try experimenting with WAV input. – fommil Jun 14 '17 at 10:17
  • @WarrenWeckesser ok, looks like that was the hint I needed! Turns out the data is `float32` not `float` which also explains that `// 2` – fommil Jun 14 '17 at 10:21

0 Answers0