How do I make the voice delay disappear?

Question

I was trying to do something similar to VoIP where I record voice and send it to another program on the network using UDP, it's not a question about encryption, but when I ran the code it worked, apart from the fact that the audio came out choppy.

In other words, in some words that I dropped I could hear them in full, but other longer phrases could always identify the moment when a signal was interrupted and he waited for another packet to be delivered to continue transmitting.

I'm asking how do I make my voice sound soft on the receiving side? Because I tried using Threading to try to optimize the recording but it didn't make much difference and I don't know where else to go.

The Server Side:

import sounddevice as sd
import socket, pickle

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

h = socket.gethostbyname(socket.gethostname())

s.bind((h,9001))

print("Servidor Rodando em "+str(h)+":9001")

while True:
    r = pickle.loads(s.recvfrom(102400)[0])
    sd.play(r,4410)

The Client Side:

import sounddevice as sd
import socket, pickle, threading

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

ip = input("IP >> ")

data = None

def Enviar():
    global data
    s.sendto(pickle.dumps(data),(ip,9001))

while True:
    data = sd.rec(4410, samplerate=4410, channels=2)
    sd.wait()
    threading.Thread(target=Enviar, args=()).start()

you need to identify what is causing the packets to be dropped. Since you are using UDP, it's normal for *some* packets to be dropped, but not very many. — user253751, Jun 30 '21 at 14:25
Look, I was running both programs on the same wifi, and I could see that an audio that played was a continuation of the previous one, I just want to know how I make the transition of played signals smoother — Arthur Sally, Jun 30 '21 at 14:32
Network delays are unavoidable and unpredictable. The only way around it is to intentionally delay your playback so it covers any gaps in the network. — Mark Ransom, Jun 30 '21 at 15:14
@MarkRansom Dropped packets on a not-very-busy LAN (even Wi-Fi) should still be fairly infrequent, not frequent enough to make it "choppy". Something else must be going wrong here. — user253751, Jun 30 '21 at 15:25
The first thing I would do is change the LAN from Wi-Fi to wired Ethernet and get things working decently that way. Wi-Fi is notoriously inconsistent (both in terms of packet-delivery-success and packet-delivery-timing) and it will be much more difficult to find and fix any software problems if your network's performance-characteristics are surreptitiously changing on you from one moment to the next because your downstairs neighbor turned on the vacuum cleaner or etc. — Jeremy Friesner, Jun 30 '21 at 15:29
@user253751 that's why I said "delays" and not "drops". Although TCP would be better than UDP if you want to trade drops for delays. — Mark Ransom, Jun 30 '21 at 15:30
@JeremyFriesner the surest way to kill the wifi in our house is to turn on the microwave. I don't know what happens if the neighbor turns on theirs. — Mark Ransom, Jun 30 '21 at 15:32
@MarkRansom The code in question will let packets queue up (in the OS); any delay should be compensated because the queue will get longer after each gap not covered by the queue, until the queue covers all the gaps. The queue will get shorter after a dropped packet. — user253751, Jun 30 '21 at 15:43
@user253751 how can packets queue up when they're being consumed at the same rate they're being produced? — Mark Ransom, Jul 01 '21 at 00:54
@MarkRansom If packets are consumed at the rate they're produced, the queue *remains the same size*... if there's an underrun (the queue is empty) it can't get any shorter. But it can get longer, when the next two packets are received at once. — user253751, Jul 01 '21 at 17:14

Jeremy Friesner · Accepted Answer · 2021-06-30T15:31:33.533

With computer audio, the receiving computer's sound card has a sample clock that determines how fast it converts audio sample values into electrical signals that drive the speaker. The sample clock runs at a fixed rate (e.g. 48000 samples per second, or whatever you've set it to) and in order for the audio to sound correct, a new audio sample must be fed into the sound card every 1/48000th of a second.

In order to reduce the CPU load on the host computer, the sound card usually has a built-in audio buffer, so that instead of forcing the CPU to wake up every 1/48000th of a second to send exactly one sample, you can instead have the CPU wake up e.g. once every 100mS, and write in 4800 samples all at once. The sound card's internal electronics would then manage feeding the individual samples from that buffer instead.

Therefore, the secret to continuous sound is never to let the sound card's buffer become empty. When the buffer is drained to empty (and therefore the sound card can't get the next sample to play at the instant it needs to play it) that is known as an audio underrun and it causes a glitch in the audio, as you heard.

The easiest way to prevent the underruns is to buffer up more audio on the receiving computer, so that more time can pass without data being received before an underrun occurs. Of course, the downside of this is that there will be more latency between the time the sender sends the data and the time receiver plays it; that's probably okay for e.g. streaming recorded music, but not so good for a live voice conversation.

The harder approach is to ensure that all data makes it across the network in a short amount of time; to do this with guaranteed reliability you need a special networking switch that allows devices to pre-reserve bandwidth so that they can guarantee that their audio packets won't get dropped. Without this guarantee, you are left just hoping for the best; on a wired Ethernet connection you can often get away with it for a small number of audio channels, but over WiFi, as you've seen, the network is often very unreliable and so you will probably hear underrun-glitches in many situations, unless you dial up the buffering quite a lot.

Some protocols use Forward Error Correction math to encode the audio in such a way that even if some subset of the UDP packets are lost, the original audio sample values can still be reconstructed from the remaining packets that were received. That increases the overall bandwidth usage somewhat, but it allows audio to avoid glitching as long as the number of dropped packets is relatively small. I'm not very familiar with how they work, however, so I can't say more about that.

The final approach (which I think is what you are asking about) is to have the receiving computer somehow try to "paper over" the missing audio by making up its own replacement sample-values for the missing audio. There are voice protocols that try to do this, with varying degrees of success (you've probably heard the results when talking over a bad cell-phone connection), but IMHO it's not really worth implementing, because there will still be an obvious glitch in the audio; just a different-sounding glitch. It might be worthwhile to fade the last samples of the received audio out to zero if you don't have more samples to follow them (to at least avoid an abrupt "pop") and then after new (post-underrun) audio is received, fade the first samples of the newly-received audio in as well (to avoid a second "pop"), but that only makes the glitch less annoying; it doesn't get rid of it.

How do I make the voice delay disappear?

1 Answers1