Windows Media Foundation using IMFTransform to decode mp4 movie frames to 2D textures

Question

I'm trying to decode an mp4 video using Windows Media Foundation classes and converting frames in to 2D textures that can be used by a DirectX shader for rendering. I've been able to read the source stream using MFCreateSourceReaderfromURL and been able to read the media type of the stream which has its major type MFMEdiaType_Video and minor type as MFVideoFormat_H264 as expected.

I'm now needing to convert this format in to an RGB format that could be used to initialise a D3D11_TEXTURE2D resource and resource view which can then be passed to a HLSL pixel shader for sampling. I've tired using the IMFTransform class to do the conversion for me but when I try to set the output type on the transform to any MFVideoFormat_RGB variant I get an error. I've also tried setting a new output type on the source reader and just Sampling that hoping to get a sample in the correct format but again I've had no luck.

So my questions would be:

Is this type of conversion possible?
Can this be done through the IMFTransform/SourceReader classes like I've tired above and do I just need to tweak the code or do I need to do this type of conversion manually?
Is this the best way to go about feeding video texture data in to a shader for sampling or is there an easier alternative that I've not thought about.

The OS being used is Windows 7 so I can't use the SourceReaderEx or ID3D11VideoDevice interface because as far as I'm aware these solutions only seem available on Windows 8.

Any help/pointers in the right direction would be greatly appreciated, I can also provide some source code if necessary.

score 4 · Answer 1 · answered May 27 '16 at 16:08

Is this type of conversion possible?

Yes it is possible. Stock H.264 Video Decoder MFT is "Direct3D aware" which means it can decode video into Direct3D 9 surfaces/Direct3D 11 textures leveraging DXVA. Or, if hardware capabilities are insufficient there is a software fallback mode too. You are interested in getting the output delivered right into texture for performance reasons (otherwise you would have to load this data yourself spending CPU and video resources on that).

Can this be done through the IMFTransform/SourceReader classes like I've tired above and do I just need to tweak the code or do I need to do this type of conversion manually?

IMFTransform is abstract interface. It is implemented by H.264 decoder (as well as other MFTs) and you can use it directly, or you can use higher level Source Reader API to get it manage video reading from file and decoding using this MFT.

That is, MFT and Source Reader are not actually exclusive alternate option but instead a higher and lower level APIs. MFT interface is offered by decoder and you are responsible to feed H.264 in and drain the decoded output. Source Reader manages the same MFT and adds file reading capability.

Source Reader itself is available in Windows 7, BTW (even on Vista, might be limited in feature set compared to newer OSes though).

Thanks for the feedback and cheers for clearing those questions up for me. I'm guessing the SourceReaders job is to provide samples that can be fed in to an MFT right? I'm having trouble with the ProcessOutput stage of the transform despite the ReadSample from SourceReader returning S_OK and ProcessInput on the MFT returning S_OK — TheRarebit, May 31 '16 at 08:41
Basically you can have the decoding MFT embedded into source reader if you provide respective source reader configuration. In this case MFT is managed by source reader and you get already decoded samples. Otherwise you might want to prefer original samples and then you are on your own with the MFT doing ProcessInput, ProcessOutput and the rest of the calls. In the former case it is of course easier to troubleshoot problems because you see that reading original samples is okay. — Roman R., Jun 01 '16 at 09:43

score 2 · Accepted Answer · answered May 26 '16 at 22:52

I see that you have some mistake in understanding of Media Foundation. You want get image in RGB format from MFVideoFormat_H264, but you do not use decoder H264. You wrote "I've tired using the IMFTransform class" - IMFTransform is not class. It is interface for Transform COM objects. You must create COM object Media Foundation H264 decoder. The CLSID for the Microsoft software H264 decoder is CLSID_CMSH264DecoderMFT. However, from that decoder you can get output image in the next formats: Output Types

MFVideoFormat_I420

MFVideoFormat_IYUV

MFVideoFormat_NV12

MFVideoFormat_YUY2

MFVideoFormat_YV12

You can create D3D11_TEXTURE2D from one of them. Or you can do something like this from my project CaptureManager SDK:

                CComPtrCustom<IMFTransform> lColorConvert;

                if (!Result(lColorConvert.CoCreateInstance(__uuidof(CColorConvertDMO))))
                {
                    lresult = MediaFoundationManager::setInputType(
                        lColorConvert,
                        0,
                        lVideoMediaType,
                        0);

                    if (lresult)
                    {
                        break;
                    }

                    DWORD lTypeIndex = 0;

                    while (!lresult)
                    {

                        CComPtrCustom<IMFMediaType> lOutputType;

                        lresult = lColorConvert->GetOutputAvailableType(0, lTypeIndex++, &lOutputType);

                        if (!lresult)
                        {


                            lresult = MediaFoundationManager::getGUID(
                                lOutputType,
                                MF_MT_SUBTYPE,
                                lSubType);

                            if (lresult)
                            {
                                break;
                            }

                            if (lSubType == MFVideoFormat_RGB32)
                            {
                                LONG lstride = 0;

                                MediaFoundationManager::getStrideForBitmapInfoHeader(
                                    lSubType,
                                    lWidth,
                                    lstride);

                                if (lstride < 0)
                                    lstride = -lstride;

                                lBitRate = (lHight * (UINT32)lstride * 8 * lNumerator) / lDenominator;

                                lresult = MediaFoundationManager::setUINT32(
                                    lOutputType,
                                    MF_MT_AVG_BITRATE,
                                    lBitRate);

                                if (lresult)
                                {
                                    break;
                                }


                                PROPVARIANT lVarItem;

                                lresult = MediaFoundationManager::getItem(
                                    *aPtrPtrInputMediaType,
                                    MF_MT_FRAME_RATE,
                                    lVarItem);

                                if (lresult)
                                {
                                    break;
                                }

                                lresult = MediaFoundationManager::setItem(
                                    lOutputType,
                                    MF_MT_FRAME_RATE,
                                    lVarItem);

                                if (lresult)
                                {
                                    break;
                                }

                                (*aPtrPtrInputMediaType)->Release();

                                *aPtrPtrInputMediaType = lOutputType.detach();

                                break;
                            }
                        }
                    }
                }

You can set ColorConvertDMO for converting from output format of the H264 decoder into the needed one of you.

Also, you can view code by link: videoInput. This code takes live video from web cam and decode it into the RGB. If you replace web cam source on mp4 video file source you will get the solution which is close to your need.

Regards

That's great, thanks very much for the feedback I really appreciate it. I've followed the first steps and gotten the Media Foundation H264 decoder via CoCreateInstance and I've managed to set a valid output format as you've advised. The issue I'm getting at the moment is ProcessOutput always returns MF_E_TRANSFORM_NEED_MORE_INPUT. I keep calling ProcessInput until it returns MF_E_NOTACCEPTING which from what I've read means the output stage is ready so I can call ProcessOutput? But still no luck. Any advice on that? I'll also take a look at that sample you provided as well, cheers. — TheRarebit, May 27 '16 at 10:58
Hi, method ProcessOutput of IMFTransform is called with pointer on MFT_OUTPUT_DATA_BUFFER struct. In my code it looks like this: MFT_OUTPUT_DATA_BUFFER loutputDataBuffer; initOutputDataBuffer( lTransform, loutputDataBuffer); DWORD lprocessOutputStatus = 0; lresult = lTransform->ProcessOutput( 0, 1, &loutputDataBuffer, &lprocessOutputStatus); you must define method initOutputDataBuffer. — Evgeny Pereguda, May 28 '16 at 11:08
Try videoInput. It has example for rendering live video from web camera via OpenGL. You can modify it for DirectX. By link [CreateObjectFromURL](https://msdn.microsoft.com/en-us/library/windows/desktop/ms702279(v=vs.85).aspx) you will find code for creating MediaSource for video file. It has declared signature - CreateMediaSource. In [videoInput](http://www.codeproject.com/Articles/776058/Capturing-Live-video-from-Web-camera-on-Windows-an) in file MediaFoundation there is method getSorceBySymbolicLink. Both function have similar signatures - try replace and it will work with video file. — Evgeny Pereguda, May 28 '16 at 11:18
Thanks for the continued feedback, with regards to your initOutputDataBuffer function you mention, I understand that the sample that is used as the ProcessOutput result needs to be allocated and set up before ProcessOutput is called and I'm guessing this is where my code is failing. My question now would be how do I know the size of the buffer I need to allocate on the output sample and what other properties need setting? I'm guessing that after decoding the input sample may be larger/smaller than the original sample? — TheRarebit, May 31 '16 at 08:37
Hi, after decoding Output sample has fixed size according of the format: RGB24, RGB32, NV1, YVY or other. There is function - MFCalculateImageSize which takes SubType, Width, Height, and computes SizeImage - byte size of the specific uncompressed image. — Evgeny Pereguda, May 31 '16 at 23:02
That's awesome, that solved the issue and now I'm getting the first frame of my movie rendering as expected. Thanks so much for the help. — TheRarebit, Jun 01 '16 at 07:31
Just one quick other question, in the example in your answer when you're calculating the bit rate you use lNumerator and lDenominator, what do they refer to? I was assuming the frame rate? — TheRarebit, Jun 01 '16 at 09:59
MF cannot work with float point value. It must be set as a proportion of two integer values and frame rate is lNumerator / lDenominator. For easy getting they from IMFMediaType you can use [MFGetAttributeRatio](https://msdn.microsoft.com/en-us/library/windows/desktop/ms695324(v=vs.85).aspx) with GUID MF_MT_FRAME_RATE — Evgeny Pereguda, Jun 01 '16 at 10:39

score 1 · Answer 3 · answered May 31 '16 at 23:41

Decoding can be executed by the next code:

                    MFT_OUTPUT_DATA_BUFFER loutputDataBuffer;

                    initOutputDataBuffer(
                        lTransform,
                        loutputDataBuffer);

                    DWORD lprocessOutputStatus = 0;

                    lresult = lTransform->ProcessOutput(
                        0,
                        1,
                        &loutputDataBuffer,
                        &lprocessOutputStatus);

                    if ((HRESULT)lresult == E_FAIL)
                    {
                        break;
                    }

function initOutputDataBuffer allocates the needed memory. Example of that function is presented there:

            Result initOutputDataBuffer(IMFTransform* aPtrTransform,
            MFT_OUTPUT_DATA_BUFFER& aRefOutputBuffer)
        {
            Result lresult;

            MFT_OUTPUT_STREAM_INFO loutputStreamInfo;

            DWORD loutputStreamId = 0;

            CComPtrCustom<IMFSample> lOutputSample;

            CComPtrCustom<IMFMediaBuffer> lMediaBuffer;

            do
            {
                if (aPtrTransform == nullptr)
                {
                    lresult = E_POINTER;

                    break;
                }

                ZeroMemory(&loutputStreamInfo, sizeof(loutputStreamInfo));

                ZeroMemory(&aRefOutputBuffer, sizeof(aRefOutputBuffer));

                lresult = aPtrTransform->GetOutputStreamInfo(loutputStreamId, &loutputStreamInfo);

                if (lresult)
                {
                    break;
                }

                if ((loutputStreamInfo.dwFlags & MFT_OUTPUT_STREAM_PROVIDES_SAMPLES) == 0 &&
                    (loutputStreamInfo.dwFlags & MFT_OUTPUT_STREAM_CAN_PROVIDE_SAMPLES) == 0)
                {
                    lresult = MFCreateSample(&lOutputSample);

                    if (lresult)
                    {
                        break;
                    }

                    lresult = MFCreateMemoryBuffer(loutputStreamInfo.cbSize, &lMediaBuffer);

                    if (lresult)
                    {
                        break;
                    }

                    lresult = lOutputSample->AddBuffer(lMediaBuffer);

                    if (lresult)
                    {
                        break;
                    }

                    aRefOutputBuffer.pSample = lOutputSample.Detach();
                }
                else
                {
                    lresult = S_OK;
                }

                aRefOutputBuffer.dwStreamID = loutputStreamId;
            } while (false);

            return lresult;
        }

It needs get information about output samples via GetOutputStreamInfo method of IMFTransform. MFT_OUTPUT_STREAM_INFO contains info about the needed size of memory for output media sample - cbSize. It needs to allocate memory with that size, adds it into the MediaSample and attaches it to th MFT_OUTPUT_DATA_BUFFER.

So, you see that writing code for encoding and decoding video via direct calling of the MediaFoundation function can be difficult and needs significant knowledge about it. From description of you task I see that you need only decode video and present it. I can advise you try use Media Foundation Session functionality. It is developed by engineers of Microsoft and already includes algorithms for using of the needed encoders and optimized. In project videoInput Media Foundation Session is used for finding the suitable decoder for Media Source which is created for web camera and grabbing of the frames in uncompressed format. It is already do the needed processing. You need only replace Media Source from web camera on Media Source from video file. It could by more easy then writing code with direct calling of IMFTransform for decoding and allows to simplify many problems (for example - stabilizing of frame rate. If code will render image immediately after decoding and then decode new frame then it can render 1 minutes video clip during a couple seconds, or if rendering of video and other content can take more than one frame duration video can be presented in "Slow motion" style and rendering of the 1 minute video clip can take 2, 3 or 5 minutes. I do not know for what project you need decoding of video, but you should have serious reasons for using code with direct calling of the Media Foundation functions and interfaces.

Regards.

Windows Media Foundation using IMFTransform to decode mp4 movie frames to 2D textures

3 Answers3

Linked