5

Folks,

I am wondering if someone can explain to me what exactly is the output of video decoding. Let's say it is a H.264 stream in an MP4 container.

From displaying something on the screen, I guess decoder can provider two different types of output:

  1. Point - (x, y) coordinate of the location and the (R, G, B) color for the pixel
  2. Rectangle (x, y, w, h) units for the rectangle and the (R, G, B) color to display

There is also the issue of time stamp.

Can you please enlighten me or point me the right link on what is generated out of a decoder and how a video client can use this information to display something on screen?

I intend to download VideoLAN source and examine it but some explanation would be helpful.

Thank you in advance for your help.

Regards, Peter

cdhowie
  • 158,093
  • 24
  • 286
  • 300
Peter
  • 11,260
  • 14
  • 78
  • 155

2 Answers2

6

None of the above.

Usually the output will be a stream of bytes that contains just the color data. The X,Y location is implied by the dimensions of the video.

In other words, the first three bytes might encode the color value at (0, 0), the second three byte the value at (0, 1), and so on. Some formats might use four bytes groups, or even a number of bits that doesn't add up to one byte -- for example, if you use 5 bits for each color component and you have three color components, that's 15 bits per pixel. This might be padded to 16 bits (exactly two bytes) for efficiency, since that will align data in a way that CPUs can better process it.

When you've processed exactly as many values as the video is wide, you've reached the end of that row. When you've processed exactly as many rows as the video is high, you've reached the end of that frame.

As for the interpretation of those bytes, that depends on the color space used by the codec. Common color spaces are YUV, RGB, and HSL/HSV.

It depends strongly on the codec in use and what input format(s) it supports; the output format is usually restricted to the set of formats that are acceptable inputs.

Timestamp data is a bit more complex, since that can be encoded in the video stream itself, or in the container. At a minimum, the stream would need a framerate; from that, the time of each frame can be determined by counting how many frames have been decoded already. Other approaches, like the one taken by AVI, is to include a byte-offset for every Nth frame (or just the keyframes) at the end of the file to enable rapid seeking. (Otherwise, you would need to decode every frame up to the timestamp you're looking for in order to determine where in the file that frame is.)

And if you're considering audio data too, note that with most codecs and containers, the audio and video streams are independent and know nothing about each other. During encoding, the software that writes both streams into the container format does a process called muxing. It will write out the data in chunks of N seconds each, alternating between streams. This allows whoever is reading the stream to get N seconds of video, then N seconds of audio, then another N seconds of video, and so on. (More than one audio stream might be included too -- this technique is frequently used to mux together video, and English and Spanish audio tracks into a single file that contains three streams.) In fact, even subtitles can be muxed in with the other streams.

cdhowie
  • 158,093
  • 24
  • 286
  • 300
  • 1
    cdhowie. Thank you very much for your explanation. I have a subsequent question. From what you have described, the video client has to draw each frame independently. Won't that be too much CPU/GPU consuming given that the changes between consecutive frames are very small? Is it left to the video client to compare previous frame with the next frame, identify the pixels that need to be redrawn and just draw that portion on the screen? – Peter Aug 18 '11 at 05:22
  • 2
    @Peter It seems like a lot of CPU, but it's not. The frames are usually drawn in one operation by pushing the frame buffer to the video card with the help of the video card driver. There are also video cards that support hardware video decoding, so the software application would actually ship the *compressed* video stream to the GPU and it would decode it on-chip and render it directly to the display with little to no CPU involvement. Even without these optimizations, modern CPUs are very fast and could handle tasks like this easily. – cdhowie Jan 26 '15 at 16:17
1

cdhowie got most of it. When it comes to timestamps the MPEG4 container contains tables for each frame that tells the video client when to display each frame. You should look at the spec for MPEG4. You normally have to pay for this I think but it's definitely downloadable from places.

http://en.wikipedia.org/wiki/MPEG-4_Part_14

James
  • 9,064
  • 3
  • 31
  • 49
  • Adobe's F4V video file format is a superset of MPEG4 and the specification can be downloaded (for free) from http://download.macromedia.com/f4v/video_file_format_spec_v10_1.pdf – Perry Aug 17 '11 at 23:15
  • James. Appreciate your help. A subsequent question. I see there are two ways for the video client to do the processing. 1) Always look at the timetable, "seek" to the location for the current time, and process the decoder output. 2) Just keep getting the next frame and continue to process but lookup the timetable only when it realizes that it is falling behind in time. What is the general approach used by video clients? – Peter Aug 18 '11 at 05:30