1

I am trying to implement a hybrid video coding framework which is used in the H.264/MPEG-4 video standard for which I need to perform 'Intra-frame Prediction' and 'Inter Prediction' (which in other words is motion estimation) of a set of 30 frames for video processing in Matlab. I am working with Mother-daughter frames.

Please note that this post is very similar to my previously asked question but this one is solely based on Matlab computation.

Edit: I am trying to implement the framework shown below:

enter image description here

My question is how to perform horizontal coding method which is one of the nine methods of Intra Coding framework? How are the pixels sampled?

enter image description here

What I find confusing is that Intra Prediction needs two inputs which are the 8x8 blocks of input frame and the 8x8 blocks of reconstructed frame. But what happens when coding the very first block of the input frame since there will be no reconstructed pixels to perform horizontal coding?

In the image above the whole system is a closed loop where do you start?

END:

Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?

I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.

Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?

Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?

I know that pixels from reconstructed block are used for comparing, but is it done one pixel at a time within the search area? If so wouldn't that be too time consuming if 30 frames are to be processed?

Community
  • 1
  • 1
David Norman
  • 301
  • 2
  • 10
  • 18

2 Answers2

3

Continuing on from our previous post, let's answer one question at a time.


Question #1

Usually, you use one I-frame and denote this as the reference frame. Once you use this, for each 8 x 8 block that's in your reference frame, you take a look at the next frame and figure out where this 8 x 8 block best moved in this next frame. You describe this displacement as a motion vector and you construct a P-frame that consists of this information. This tells you where the 8 x 8 block from the reference frame best moved in this frame.

Now, the next question you may be asking is how many frames is it going to take before we decide to use another reference frame? This is entirely up to you, and you set this up in your decoder settings. For digital broadcast and DVD storage, it is recommended that you generate an I-frame every 0.5 seconds or so. Assuming 24 frames per second, this means that you would need to generate an I-frame every 12 frames. This Wikipedia article was where I got this reference.

As for the intra-coding modes, these tell the encoder in what direction you should look for when trying to find the best matching block. Actually, take a look at this paper that talks about the different prediction modes. Take a look at Figure 1, and it provides a very nice summary of the various prediction modes. In fact, there are nine all together. Also take a look at this Wikipedia article to get better pictorial representations of the different mechanisms of prediction as well. In order to get the best accuracy, they also do subpixel estimation at a 1/4 pixel accuracy by doing bilinear interpolation in between the pixels.

I'm not sure whether or not you need to implement just motion compensation with P-frames, or if you need B frames as well. I'm going to assume you'll be needing both. As such, take a look at this diagram I pulled off of Wikipedia:

Source: Wikipedia

This is a very common sequence for encoding frames in your video. It follows the format of:

IBBPBBPBBI...

There is a time axis at the bottom that tells you the sequence of frames that get sent to the decoder once you encode the frames. I-frames need to be encoded first, followed by P-frames, and then B-frames. A typical sequence of frames that are encoded in between the I-frames follow this format that you see in the figure. The chunk of frames in between I-frames is what is known as a Group of Pictures (GOP). If you remember from our previous post, B-frames use information from ahead and from behind its current position. As such, to summarize the timeline, this is what is usually done on the encoder side:

  • The I-frame is encoded, and then is used to predict the first P-frame
  • The first I-frame and the first P-frame are then used to predict the first and second B-frame that are in between these frames
  • The second P-frame is predicted using the first P-frame, and the third and fourth B-frames are created using information between the first P-frame and the second P-frame
  • Finally, the last frame in the GOP is an I-frame. This is encoded, then information between the second P-frame and the second I-frame (last frame) are used to generate the fifth and sixth B-frames

Therefore, what needs to happen is that you send I-frames first, then the P-frames, and then the B-frames after. The decoder has to wait for the P-frames before the B-frames can be reconstructed. However, this method of decoding is more robust because:

  • It minimizes the problem of possible uncovered areas.
  • P-frames and B-frames need less data than I-frames, so less data is transmitted.

However, B-frames will require more motion vectors, and so there will be some higher bit rates here.

Question #2

Honestly, what I have seen people do is do a simple Sum-of-Squared Differences between one frame and another to compare similarity. You take your colour components (whether it be RGB, YUV, etc.) for each pixel from one frame in one position, subtract these with the colour components in the same spatial location in the other frame, square each component and add them all together. You accumulate all of these differences for every location in your frame. The higher the value, the more dissimilar this is between the one frame and the next.

Another measure that is well known is called Structural Similarity where some statistical measures such as mean and variance are used to assess how similar two frames are.

There are a whole bunch of other video quality metrics that are used, and there are advantages and disadvantages when using any of them. Rather than telling you which one to use, I defer you to this Wikipedia article so you can decide which one to use for yourself depending on your application. This Wikipedia article describes a whole bunch of similarity and video quality metrics, and the buck doesn't stop there. There is still on-going research on what numerical measures best capture the similarity and quality between two frames.

Question #3

When searching for the best block from an I-frame that has moved in a P-frame, you need to restrict the searching to a finite sized windowed area from the location of this I-frame block because you don't want the encoder to search all of the locations in the frame. This would simply be too computationally intensive and would thus make your decoder slow. I actually mentioned this in our previous post.

Using one pixel to search for another pixel in the next frame is a very bad idea because of the minuscule amount of information that this single pixel contains. The reason why you compare blocks at a time when doing motion estimation is because usually, blocks of pixels have a lot of variation inside the blocks which are unique to the block itself. If we can find this same variation in another area in your next frame, then this is a very good candidate that this group of pixels moved together to this new block. Remember, we're assuming that the frame rate for video is adequately high enough so that most of the pixels in your frame either don't move at all, or move very slowly. Using blocks allows the matching to be somewhat more accurate.

Blocks are compared at a time, and the way blocks are compared is using one of those video similarity measures that I talked about in the Wikipedia article I referenced. You are certainly correct in that doing this for 30 frames would indeed be slow, but there are implementations that exist that are highly optimized to do the encoding very fast. One good example is FFMPEG. In fact, I use FFMPEG at work all the time. FFMPEG is highly customizable, and you can create an encoder / decoder that takes advantage of the architecture of your system. I have it set up so that encoding / decoding uses all of the cores on my machine (8 in total).

This doesn't really answer the actual block comparison itself. Actually, the H.264 standard has a bunch of prediction mechanisms in place so that you're not looking at all of the blocks in an I-frame to predict the next P-frame (or one P-frame to the next P-frame, etc.). This alludes to the different prediction modes in the Wikipedia article and in the paper that I referred you to. The encoder is intelligent enough to detect a pattern, and then generalize an area of your image where it believes that this will exhibit the same amount of motion. It skips this area and moves onto the next.


This assignment (in my opinion) is way too broad. There are so many intricacies in doing motion prediction / compensation that there is a reason why most video engineers already use available tools to do the work for us. Why invent the wheel when it has already been perfected, right?

I hope this has adequately answered your questions. I believe that I have given you more questions than answers really, but I hope that this is enough for you to delve into this topic further to achieve your overall goal.

Good luck!

Community
  • 1
  • 1
rayryeng
  • 102,964
  • 22
  • 184
  • 193
  • Thanks for such as elaborate answer this helps. So first things first, I am asked to perform Intra prediction. From what I can tell, I need the original image and the reconstruction image right? Reconstruction of an image is easy but what I don't understand is how to perform one of the nine intra coding modes (we have to implement only one of these or our choice). I chose to do horizontal coding. What is the best way to find a match? I use Sum of squared error and sum of absolute difference right? – David Norman Aug 17 '14 at 06:19
  • OK, there is *inter*-coding and *intra*-coding. Inter-coding uses encoded blocks from the previous or future frames to encode that particular frame (i.e. P- and B-frames). Intra-coding **only uses** information from the current frame to encode the frame (i.e. I-frames). Now, to address Intra-prediction, take a look at this paper for more details: http://ip.hhi.de/imagecom_G1/assets/pdfs/csvt_overview_0305.pdf = Page 568 specifically. These slides are also great too: http://courses.cs.washington.edu/courses/csep590a/07au/lectures/rahullarge.pdf - (will carry on in my next comment) – rayryeng Aug 17 '14 at 14:33
  • Also take a look at these slides by Iain Richardson - http://www.slideshare.net/vcodex/introduction-to-h264-advanced-video-compression - I have his book, and it's very good. In any case, you **don't** need the reconstruction image. What you do is decompose the frame into bigger blocks (64 x 64, or 128 x 128 or something). For each of these bigger blocks, you decompose these bigger blocks into 16 x 16 blocks for luma and 8 x 8 blocks for chroma. The first column of 16 x 16 (or 8 x 8) blocks in this bigger block are encoded normally.... (continuing on next comment) – rayryeng Aug 17 '14 at 14:38
  • Now, for the rest of the blocks, for each row you take the first encoded block from the first column, and you simply **copy** the encoded information over to the right all the way to the end of the row. This way, you're only encoding **a small portion** of this overall large block using the standard procedure, and you then guess what the rest of the 16 x 16 or 8 x 8 blocks in the rest of this larger block. Now the SSE or SAD are used when predicting motion vectors (a.k.a. inter-frame prediction), so you would then figure out where the blocks in the original frame have best moved to the next – rayryeng Aug 17 '14 at 14:42
  • This prediction is useful because it avoids having to apply the standard encoding procedure for **all** of the blocks in your frame. You only have to do it for a good percentage of the frame but not the entire frame, and the rest of the blocks are predicted based on already encoded block information. This is one of the things that H.264 does to decrease encoding time. – rayryeng Aug 17 '14 at 14:44
  • 1
    I am starting to get a clearer picture. you make things sound easy so I will try to implement this on matlab and get back to you. I wish I could give you another upvote for to-the-point explanation – David Norman Aug 17 '14 at 21:26
  • Ok so this is what I have done so far. I am working on intra-prediction. So I took the very first frame and encoded it (divided into 8x8 blocks > performed DCT > quantized it > dequantized > performed iDCT). Then I created a new frame where I took the first column of the first 8x8 encoded block and copied it across till the end of the block (performing horizontal coding). I did the same for rest of the blocks where I took first column of each block and copied it across each row. Before I go any further you reckon I performed horizontal coding the right way? – David Norman Aug 21 '14 at 08:17
  • @DavidNorman - I think so, but the reason I specified splitting up your frame into bigger blocks (like 128 x 128), then doing the horizontal coding is because for each megablock - like what you did for the entire frame, this would be a better representation of the motion. By doing what you did for the entire frame, you would only be describing the motion along the outer edges, and you would be ignoring that motion that is inside. I figure that doing it on a megablock scale would be a better representation of the motion. – rayryeng Aug 21 '14 at 14:22
  • @DavidNorman - This detail is one of those things that I was never clear about in the H.264 spec. When doing horizontal prediction, I don't know if we should do it for the **entire** frame, or if we need to decompose our image into megablocks and apply the same algorithm. Because we aren't clear about it, I say that what you're doing now is completely acceptable. What does your prof say about it? – rayryeng Aug 21 '14 at 14:23
  • I haven't shown him anything yet, he doesn't know anything I don't know how he got his PhD. Anyway, how do I sample each and every pixel of the reconstructed frame? For example, I split the frame into 20 8x8 blocks, does the intra prediction end up with having a lot more blocks than 20? I am very confused as to how to sample the blocks I thought only the outside blocks would be fine (NOTE: we are supposed to end up with 8x8 blocks to reconstruct the image) – David Norman Aug 21 '14 at 19:49
  • It says here 'To code a block reconstructed pixels spatially surrounding the block must be used as reference'. What is this supposed to mean? – David Norman Aug 21 '14 at 19:56
  • @DavidNorman - Each block is used to reconstruct 8 x 8 patches of pixels. It is within each block that the individual pixels are used to reconstruct your frame. The intraprediction should only have a subset of the blocks from the original image. The rest of the blocks are predicted using the horizontal scheme you're talking about. – rayryeng Aug 21 '14 at 20:02
  • That description pretty much is what we're doing now. When you reconstruct the predicted blocks, those blocks that were encoded without prediction are decoded. After, the predicted blocks are reconstructed by simply copying over those blocks that were decoded without prediction over. I'm going back to work. Hit me up with your status later. FWIW, I think you should hire me on Codementor if you want more dedicated help :) Only so much I can do in a comments block. – rayryeng Aug 21 '14 at 20:04
  • This is harder than rocket science. – David Norman Aug 21 '14 at 20:07
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/59765/discussion-between-david-norman-and-rayryeng). – David Norman Aug 21 '14 at 20:07
  • Did you get around going through the diagrams? I still don't understand how to go about intra coding – David Norman Aug 22 '14 at 02:59
  • @DavidNorman - Not yet. I'm about to go to bed so I'll pick it up in the morning. Hang in there! – rayryeng Aug 22 '14 at 06:04
  • I thought you weren't going to help anymore. Thanks for your help – David Norman Aug 22 '14 at 06:08
  • @DavidNorman - Aahaha any normal person probably wouldn't. Glad I'm not normal. – rayryeng Aug 22 '14 at 06:11
  • I have edited my question. Tried to be a little more specific – David Norman Aug 22 '14 at 06:54
  • @DavidNorman - Hi David. Been busy house hunting. My apologies. How did you get it to work!? Can you explain? – rayryeng Aug 24 '14 at 22:22
  • That's alright no need to apologize. I will update my question or post an answer all together very soon – David Norman Aug 25 '14 at 00:45
  • Do you know about quiver function in matlab? – David Norman Aug 31 '14 at 06:48
  • I have posted a [question](http://stackoverflow.com/questions/25589196/how-does-quiver-function-work-in-matlab) about using quiver. I'm sure for most people it is straight forward but I had a few queries. – David Norman Aug 31 '14 at 20:27
  • @DavidNorman - Luis Mendo answered your question and you accepted the answer. Do you have any more questions about `quiver`? His definition is spot on – rayryeng Aug 31 '14 at 23:02
  • I saw the post after I commented on here. Thanks – David Norman Aug 31 '14 at 23:04
  • Please help [question](http://stackoverflow.com/questions/25695974/psnr-for-intra-predicted-frame-vs-encoded-frame) – David Norman Sep 06 '14 at 06:55
-1

Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?

I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.

Answer: intra prediction need not be used for all the frames.

Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?

Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?

Answer: we need to use the block matching algo for finding the motion vector. so search area is reqd. Normally size of the search area should be larger than the block size. larger the search area, more the computation and higher the accuracy.

DB5
  • 13,553
  • 7
  • 66
  • 71
Jeffin
  • 1
  • @ElGavilan, you do realize that these are not new questions, right? Jeffin copied the questions listed in the original question and then provided answers to them. Obviously a bit of text formatting in the answer might help make this more obvious... – DB5 Dec 30 '14 at 20:57
  • This question was already solved with the OP in the summer. I went into a chat room with the OP and we addressed everything he needed to know in order to help solve his problem. In fact, I am the person who posted the answer on this post. I don't see the usefulness of your answer here. Also, no offence, but I believe my answers go into more elaborate detail than what you have provided. – rayryeng Dec 31 '14 at 00:20