Is there any way to respond when a specific frequency (combination) is heard from the system sound in python?

Question

I am currently attempting to create a modem-like script in python that uses sound to respond to other instances of itself with sounddevice, kind of like a real modem used in the older days.

I have already developed some transmit and reply functions like a DTMF generator and binary converter, but I have a problem with detecting certain frequencies (440hz + 350hz aka. dial tone), which makes me unable to continue with listening to other sounds (DTMF, data, etc.) and reply in real time.

I am also pretty new to sounddevice and numpy, only used numpy code provided by other users for opencv. I have only figured out how to create and play chosen sinewaves for a chosen amount of time. For the receiving part I mostly used ChatGPT, but its code either didn´t reply or returned an error at all times, so I´ve decided to try and make one myself, but (atleast for me) the documentation doesn´t make sense for me, hopefully yet.

If you could help me in any way with the script ChatGPT gave me, here it is:

import sounddevice as sd
import numpy as np

# Parameters
target_frequencies = [440, 350]  # Frequencies to detect (440Hz and 350Hz)
duration = 15  # Duration in seconds
sample_rate = 44100  # Sample rate

# Callback function for audio input
def audio_callback(indata, frames, time, status):
    # Convert audio data to mono
    mono_data = np.mean(indata, axis=1)
    
    # Compute the Fast Fourier Transform (FFT)
    fft_data = np.fft.fft(mono_data)
    freqs = np.fft.fftfreq(len(fft_data), 1 / sample_rate)
    
    # Find the indices of the target frequencies
    target_indices = [np.argmin(np.abs(freqs - freq)) for freq in target_frequencies]
    
    # Check if the target frequencies are present
    if all(abs(fft_data[index]) > 10000 for index in target_indices):
        print("yo yo yo")

# Start recording
with sd.InputStream(callback=audio_callback, channels=2, samplerate=sample_rate):
    print("Listening for tones...")
    sd.sleep(int(duration * 1000))  # Record for the desired duration
    print("Recording finished")

Elseway, please atleast explain to me how for example InputStream works and how I can detect sounds from it.

Thank you!

[All use of generative AI (e.g., ChatGPT1 and other LLMs) is banned when posting content on Stack Overflow.](https://meta.stackoverflow.com/questions/421831/temporary-policy-generative-ai-e-g-chatgpt-is-banned) — Reinderien, Aug 30 '23 at 01:41
As a surprise to no one, the AI script is nonsense. Why configure for two channels and then average to one, instead of just... configuring to one channel? etc. — Reinderien, Aug 30 '23 at 01:43

score 0 · Answer 1 · answered Aug 30 '23 at 03:06

I hope from what ChatGPT has produced that you've learned not to trust it for programming applications.

For your application it isn't enough to detect the maximum spectral component, and it certainly isn't enough to detect any such component above a blanket, arbitrary value of 10,000. Instead you need some kind of heuristic to compare the in-band spectral energies you care about to the total energy, and if that exceeds a threshold, then your tone is considered present. (In addition, you'll want a check for total energy to distinguish "some sound" from "background noise" for your environment; I have not shown this.)

FFT has many tradeoffs based on the sample size and the sampling frequency. You don't want resolution to be too low or you won't be able to distinguish good frequencies from bad. You don't want it too high or else each chunk will take longer than it needs to to capture (and take up more memory than it needs to, as well). You don't want sample size to be too low or you'll miss your lowest frequency. You don't want sample size to be too high or you'll take too long to capture a sample and won't respond as quickly as you could.

A reasonable value for frequency bucket size is 10 Hz in this case, because the greatest common factor of your two frequencies of interest is 10, and that's enough to distinguish those tones from other tones in the DTMF/POTS system.

Before trying it on your mic, try it on a canned file from Wikipedia:

import librosa
import numpy as np


print('Loading canned tone...')
canned, rate = librosa.load('US_dial_tone.mp3', mono=True)

# Ignore next-highest DTMF tone of 697 Hz and up
# From the Precise Tone Plan (https://en.wikipedia.org/wiki/Precise_tone_plan),
# ignore 480 and 620 Hz
freq_bucket_size = 10  # greatest common factor of 350 and 440
n = rate//freq_bucket_size
target_frequencies = 350, 440
target_idx = [f//freq_bucket_size for f in target_frequencies]

print('Processing...')
canned = canned[:len(canned) - len(canned)%n]
for chunk in canned.reshape((-1, n)):
    ampl = np.abs(np.fft.rfft(chunk))
    total_energy = ampl[1:].sum()
    tone_energy = ampl[target_idx].sum()
    match = tone_energy / total_energy
    if match > 0.5:
        print(f'Tone matched at {match:.1%} energy')

Loading canned tone...
Processing...
Tone matched at 90.4% energy
Tone matched at 91.3% energy
Tone matched at 89.7% energy
Tone matched at 90.5% energy
Tone matched at 91.1% energy
Tone matched at 91.0% energy
Tone matched at 89.7% energy
Tone matched at 90.8% energy
Tone matched at 92.1% energy
Tone matched at 90.8% energy
Tone matched at 89.7% energy
Tone matched at 90.1% energy
Tone matched at 92.5% energy
Tone matched at 91.7% energy
Tone matched at 89.8% energy
Tone matched at 91.3% energy
Tone matched at 92.1% energy
Tone matched at 90.8% energy
Tone matched at 91.2% energy
Tone matched at 91.3% energy
Tone matched at 92.5% energy
Tone matched at 91.8% energy
Tone matched at 90.3% energy
Tone matched at 93.1% energy
Tone matched at 93.2% energy
Tone matched at 91.3% energy
Tone matched at 91.8% energy
Tone matched at 93.2% energy
Tone matched at 92.6% energy
Tone matched at 92.9% energy
Tone matched at 93.3% energy
Tone matched at 93.8% energy
Tone matched at 94.1% energy
Tone matched at 92.2% energy
Tone matched at 92.9% energy
Tone matched at 93.3% energy
Tone matched at 93.8% energy
Tone matched at 93.6% energy
Tone matched at 93.0% energy
Tone matched at 94.0% energy
Tone matched at 92.1% energy
Tone matched at 91.9% energy
Tone matched at 93.4% energy
Tone matched at 93.2% energy
Tone matched at 91.3% energy
Tone matched at 92.9% energy
Tone matched at 92.8% energy
Tone matched at 91.1% energy
Tone matched at 91.4% energy
Tone matched at 90.1% energy

Thank you for your reply, but I mean to detect frequencies directly from system sounds, not through a microphone or a file. — River, Aug 30 '23 at 14:35
Adjust the front-end accordingly, then. You did not offer a reproducible example that _generates_ testable system sounds, and so I offered the most straightforwardly reproducible code whose analysis will remain the same for any kind of capture. — Reinderien, Aug 30 '23 at 14:44

Is there any way to respond when a specific frequency (combination) is heard from the system sound in python?

1 Answers1