Most video containers contain both the visual and audio elements and also a metadata block that describes things like the duration, the colorspace, the codecs used and the offset for each frame (useful when seeking). In a typical video encoded for the web as MP4 this block (aka the MOOV atom) defaults to the end of the file (as the frame location won't be known until the end) unless a second pass has been performed to move it to the front eg:
ffmpeg -i source.mp4 -c:a copy -c:v copy -movflags faststart destination.mp4
(copies the audio and video unchanged, just moves the metadata to the start to enable faster access)
You might have experienced some web video where you can seek almost immediately with an MP4 and some where you can't accurately seek until the file has been fully loaded... this is because the browser has to make 'guesses' until it receives that metadata
For mp3 files specifically you could use something like this - to request the server gives you just the ID3 Tag and eTag data (last 127 and the 227 bytes) without having to download the whole file