0

I am trying to extract all the images in a PDF and then convert them into DIB format. First part is easy. I extract all the contents in the PDF, then iterate through them and whenever I find a PDEImage, I put them in an array.

But I am clueless about how to go about the second part. Looks like all the AVConversion methods allow you to convert a whole page of a PDF, not just images, into other formats.

Is there any way I can accomplish this task? Thanks in advance!

EDIT: Further elaborating the problem.

I am writing an Adobe Acrobat Plug-in using Visual C++ with .NET Framework 4.

The purpose of the plug-in is to (among other things) extract image data from a PDF file, then convert those data to DIBs. The need to convert to DISs is because I then pass those DIBs to another library which do some image correction work on them.

Now my problem is with converting the said image data in PDFs to DIBs. The image data on PDFs are found in a format called PDEImage (Ref Link) where apparently it contains all the color data of the image. Now I'm using the following code to extract the said image data bits from the image to be used with CreateCompatibleBitmap() and SetBitmapBits() to obtain a HBITMAP handle. Then, I pass that along with other necessary parameters to the GetDIBits() to obtain a DIB in the form of a byte array as stated in the MSDN.

void GetDIBImage(PDEElement element)
{
    //Obtaining a PDEImage
    PDEImage image;
    memset(&image, 0, sizeof(PDEImage));
    image = (PDEImage)element;

    //Obtaining attributes (such as width, height)
    //of the image for later use
    PDEImageAttrs attrs;
    memset(&attrs, 0, sizeof(attrs));
    PDEImageGetAttrs(image, &attrs, sizeof(attrs));

    //Obtainig image data from PDEImage to a byte array
    ASInt32 len = PDEImageGetDataLen(image);
    byte *data = (byte *)malloc(len);
    PDEImageGetData(image, 0, data);

    //Creating a DDB using said data
    HDC hdc = CreateCompatibleDC(NULL); 
    HBITMAP hBmp = CreateCompatibleBitmap(hdc, attrs.width, attrs.height);  
    LONG bitsSet = SetBitmapBits(hBmp, len, data);  //Here bitsSet gets a value of 59000 which is close to the image's actual size

    //Buffer which GetDIBits() will fill with DIB data
    unsigned char* buff = new unsigned char[len];

    //BITMAPINFO stucture to be passed to GetDIBits()
    BITMAPINFO bmpInfo;
    memset(&bmpInfo, 0, sizeof(bmpInfo));

    bmpInfo.bmiHeader.biSize = sizeof(BITMAPINFOHEADER);
    bmpInfo.bmiHeader.biWidth = (LONG)attrs.width;
    bmpInfo.bmiHeader.biHeight = (LONG)attrs.height;
    bmpInfo.bmiHeader.biPlanes = 1;
    bmpInfo.bmiHeader.biBitCount = 8;
    bmpInfo.bmiHeader.biCompression = BI_RGB;
    bmpInfo.bmiHeader.biSizeImage = ((((bmpInfo.bmiHeader.biWidth * bmpInfo.bmiHeader.biBitCount) + 31) & ~31) >> 3) * bmpInfo.bmiHeader.biHeight;  
    bmpInfo.bmiHeader.biXPelsPerMeter = 0;
    bmpInfo.bmiHeader.biYPelsPerMeter = 0;
    bmpInfo.bmiHeader.biClrUsed = 0;
    bmpInfo.bmiHeader.biClrImportant = 0;

    //Callling GetDIBits()
    //Here scanLines get a value of 0, while buff receives no data.
    int scanLines = GetDIBits(hdc, hBmp, 0, attrs.height, &buff, &bmpInfo, DIB_RGB_COLORS);

    if(scanLines > 0)
    {
        MessageBox(NULL, L"SUCCESS", L"Message", MB_OK);
    }
    else
    {
        MessageBox(NULL, L"FAIL", L"Message", MB_OK);
    }
}

Here are my questions / concerns.

  1. Is it correct the way I'm using CreateCompatibleDC(), CreateCompatibleBitmap() and SetBitmapBits() functions? My thinking is that I use CreateCompatibleDC() to obtain current DC, then create a DDB using CreateCompatibleBitmap() and then set the actual data to the DDB using SetBitmapBits(). Is that correct?

  2. Is there a problem with the way I've created the BITMAPINFO structure. I am under the assumption that it need to contain all the details regarding the format of the DIB I will eventually obtain.

  3. Why am I not getting the bitmap data as a DIB to the buff when I call GetDIBits()?

Sach
  • 10,091
  • 8
  • 47
  • 84

1 Answers1

0

I do not know the library you use to access the inner structures of a PDF file but the problem at hand will have tree distinct subproblems:

  1. Find all images in the PDF file
  2. Decode the images to their components
  3. Convert the decoded image to a DIB

Find all Images

Images can occur inside content streams or in streams attached to dictionaries. To find all images in content streams, you need to find all content streams in either Pages, XObjects or Patterns. Each of those can have a Resources -> XObject dictionary that references all XObjects (and an XObject can be an Image).

If you avoid the inline images you might simply scan the PDF file and each dectionary that is of type XObject subtype Image can be decoded.

Decode

All streams (inline in content streams) of in separate objects in the PDF file are encoded and mught need post processing using the Decode arrays. There are several filters that you need to be able to perform for decoding. Flate decode (ZLIB), JPEG and CCITT (fax G3/G4) are probable the most used for images. Hopefully the PDF library you use will know how to decode the streams..

Next there are Decode arrays (a bit rare) where each color component can be scaled from an input value to an output value. This is a linear transformation.

To DIB

Next in line is the conversion of the decoded image to a DIB. This means you need to convert the color components to something Windows can 'get' (eg, Palette, grayscale (special palette) of RGB. PDF supports a very very large variety of color spaces and converting them to RGB is no sinecure. You best hope here is that the PDFs you need to process only use a select subset (like RGB and palette). Now a DIB can be simply created by creating the bitmap header (BITMAPINFO), fill in all data and call the DIB creation function CreateDIBSection and them process the DIB the way you application needs.

Epilogue

All in all: to be able to process all PDF files and find all images is quite a daunting task, if you control the source if teh PDFs and you know they are always in DeviceRGB format and always JPEG etc and never inlined into the content stream it is do-able.

Ritsaert Hornstra
  • 5,013
  • 1
  • 33
  • 51
  • 1. Find all iamges I want to convert PDEImage objects which are XObjects right? I loop through the contents of each PDF page and acquire them which I can do fine. 2. Decode This is where I think I have my problem. I was under the impression that PDEImageGetData(image, 0, data) will fill data buffer with the color data in the image. I'm very much new to PDFs so I've not sure what you mean by decode. Do you happen to have a code sample perhaps? 3. To DIB I figure that I can use CreateDIBSection for this purpose. So, won't it work if I simply pass the data array obtained above to it? – Sach Mar 05 '12 at 01:34
  • @Sach: I do not know what decoding your PDF library already does. but let's assume it does not decode the streams. Now the two most used encodings for color/palette images are Flate (ZLib) and JPEG decode. So if you know how to use ZLib and have a JPEG decoder your code might look like: open PDF File -> scan all objects in the file. If object is a dictionary and has a stream and type = XObject ans subtype is Image -> decode stream if needed. Now you have the raw bytes you can feed to a DIB IF and only IF the colorspace used is supported by windows. What if it is Lab or Separated colorspace... – Ritsaert Hornstra Mar 06 '12 at 09:32
  • I didn't understand the bit where you say "decode stream if needed". What I do is, I extract PDEImage objects, then I want to convert them to DIBs so I can send those DIBs (in BYTE array form) to a separate library which do some color editing/correction stuff. I'm not familiar with ZLib and how to decode images. I have the luxury of knowing that colorspace will only be either one of sRGB or AdobeRGB. – Sach Mar 07 '12 at 00:42
  • Sach: PDF supports a real plethora of colorspaces. Grayscale and RGB are two but they can be device specific or calibarted, or ICC based. You also have CMYK (for printing), Lab, Separation, DeviceN, Indexed (palette based). Converting all these towards sRGB takes you several 1000 lines of code and is quite tricky. Hopefully you won't need to support that many colorspaces. Also PDF supports 1,2,4,8 and 16 bit data per color component where Windows only supports 1, 4 and 8. Unfortunately the same goes for coding streams where there are several encodings. oh and did I mention possible encryption? – Ritsaert Hornstra Mar 07 '12 at 09:06
  • @Sach: The problem here is: what you want seems very simple and for most PDFs you just need DeviceRGB and JPEGDecode / FlateDecode. If it is JPEGDecode, the byte array found is just a JPEG file. For FlateDecode just decode the bytes with ZLib (should be libraries for about every programming language) yielding the raw bytes of the image. Perhaps you can read the ISO32000 standard for a better understanding. Focus on Colorspaces, Streams and docuent structure and you should have a good understanding what you're up against.. – Ritsaert Hornstra Mar 07 '12 at 09:10
  • Thanks Ritsaert, I think that was helpful. So in other words, if it is a JPEG all I have to do is get the JPEG data to a buffer using PDEImageGetDat() right? I will read the specs anyway. Thanks! – Sach Mar 08 '12 at 04:55
  • If the Filter used is 'DCTDecode' or shorthand 'DCT' it means the image file is encoded as a baseline JPEG. so: extract the stream, us your favorite JPEG decoder and voila.. the image. NB: the spec PDF1.7 can be downloaded for free; ISO32000 is the payed ISO document but has exactly the same text. – Ritsaert Hornstra Mar 09 '12 at 20:22