0

I want to identify the ColorSpace objects in PDF and fetch their location(coordinates, width and height of the colorspace) in the page. I tried traversing through the BaseDataObject in Contents.ContentContext.Resources.ColorSpaces, I can identify the Pantone Colorspaces in file (as shown in screenshot), but unable to find info regarding the location(x,y,w and h) of the object.

Where can I find the exact location of the visible objects(visible on opening a document) like ColorSpaces and embedded images?

I am using 'pdfclown' library to extract the info about ColorSpaces from PDF. Any inputs will be helpful. Thanks in advance.

ContentScanner cs =  new ContentScanner(page);     
System.Collections.Generic.List<org.pdfclown.documents.contents.colorSpaces.ColorSpace> list = cs.Contents.ContentContext.Resources.ColorSpaces.Values.ToList();
    for (int i = 0; i < list.Count; i++)
    {
            org.pdfclown.objects.PdfArray array = (org.pdfclown.objects.PdfArray)list[i].BaseDataObject;
            foreach (org.pdfclown.objects.PdfObject s in array)
            { 
                //print colorspace and its x,y,w,h
            }
    }

PDF Document (has CMYK and Pantone Colors)

Screenshot

Screenshot

mkl
  • 90,588
  • 15
  • 125
  • 265
ksa
  • 47
  • 8
  • 1
    **A** What is the `cs` you retrieve the `ColorSpace` list from? **B** You say *"visible objects like ColorSpaces"*, but the color spaces defined in a PDF are not visible objects by themselves. Thus, please explain what you mean exactly. **C** You ask for the location of *"ColorSpaces, embedded images and attachments e.t.c"*, but attachments are something completely different than the former two, so your "e.t.c" might mean anything. Thus, please explain what you mean exactly. **D** Your PDF document link is dead. – mkl May 22 '19 at 10:42
  • 1
    Ok, the link works now. Now please clarify as asked above. Furthermore, **E** you say you want to *"identify the ColorSpace objects in PDF and highlight them"*; identifying color spaces is no problem, but what do you mean by *highlighting* them? – mkl May 23 '19 at 11:24
  • Hi @mkl, **A** `cs` is the `ContentScanner` object (`ContentScanner cs = new ContentScanner(page); ` ). **B** I mean the colorspaces, images which are visible to the viewer in that example file. **C** I mean colorspaces and embedded images, updated in the question). **D** updated the links. **E** if the user able to see the **colorspace** on opening the document, then I want to fetch the x,y,w,h of those **colorspaces**. Updated the question. – ksa May 23 '19 at 11:58
  • 1
    *"if the user able to see the colorspace on opening the document"* - a PDF **ColorSpace** object is *not visible*, so a user can *never see* it. What you probably mean is something like [this](https://i.stack.imgur.com/Dd6Ta.png) but that is merely a collection of rectangles filled with different colors. In particular there is nothing indicating that these rectangles somehow belong together (other than that their coordinates are arranged so that the rectangles are drawn next to each other). – mkl May 23 '19 at 12:56
  • 1
    If you don't happen to mean something like [this](https://i.stack.imgur.com/Dd6Ta.png) by "colorspace" then please make clear what you actually mean. – mkl May 23 '19 at 12:58
  • Hi @mkl, I assumed that the coloured rectangle is the visualisation of the **ColorSpace** object and its value. Got to know that's wrong after I read your answer. Thanks for correcting me. – ksa May 23 '19 at 17:31

1 Answers1

1

I want to identify the ColorSpace objects in PDF and fetch their location(coordinates, width and height of the colorspace) in the page.

I assume you mean the squares here:

Pantone solid

Beware, these are not PDF ColorSpace objects, these are a number of simple (rectangular) paths filled with distinct colors and with some text drawn upon them.

PDF ColorSpaces are not specific renderings of colored areas, they are abstract color specifications:

Colours may be described in any of a variety of colour systems, or colour spaces. Some colour spaces are related to device colour representation (grayscale, RGB, CMYK), others to human visual perception (CIE-based). Certain special features are also modelled as colour spaces: patterns, colour mapping, separations, and high-fidelity and multitone colour.

(ISO 32000-1, section 8.6 "Colour Spaces")

As you look for something with coordinates, width and height, therefore, you are looking for drawing instructions using those abstract color spaces, not for the plain color spaces.

I tried traversing through the BaseDataObject in Contents.ContentContext.Resources.ColorSpaces, I can identify the Pantone Colorspaces in file (as shown in screenshot), but unable to find info regarding the location(x,y,w and h) of the object.

By looking at cs.Contents.ContentContext.Resources.ColorSpaces you get an enumeration of all special color spaces available for use in the current context but not the actual usages. To get the actual usages, you have to traverse the ContentScanner cs, i.e. you have to inspect the instructions in the current context, e.g. like this:

SeparationColorSpace space = null;
double X = 0, Y = 0, Width = 0, Height = 0;

void ScanForSpecialColorspaceUsage(ContentScanner cs)
{
    cs.MoveFirst();
    while (cs.MoveNext())
    {
        ContentObject content = cs.Current;
        if (content is CompositeObject)
        {
            ScanForSpecialColorspaceUsage(cs.ChildLevel);
        }
        else if (content is SetFillColorSpace _cs)
        {
            ColorSpace _space = cs.Contents.ContentContext.Resources.ColorSpaces[_cs.Name];
            space = _space as SeparationColorSpace;
        }
        else if (content is SetDeviceCMYKFillColor || content is SetDeviceGrayFillColor || content is SetDeviceRGBFillColor)
        {
            space = null;
        }
        else if (content is DrawRectangle _dr)
        {
            if (space != null)
            {
                X = _dr.X;
                Y = _dr.Y;
                Width = _dr.Width;
                Height = _dr.Height;
            }
        }
        else if (content is PaintPath _pp)
        {
            if (space != null && _pp.Filled && (X != 0 || Y != 0 || Width != 0 || Height != 0))
            {
                String name = ((PdfName)((PdfArray)space.BaseDataObject)[1]).ToString();
                Console.WriteLine("Filling rectangle at {0}, {1} with size {2}x{3} using {4}", X, Y, Width, Height, name);
            }
            X = 0;
            Y = 0;
            Width = 0;
            Height = 0;
        }
    }
}

BEWARE: This merely is a proof-of-concept, simplified as much as possible to still work in your PDF for the squares in the screen shot above.

For a general solution you will have to extend this considerably:

  • The code only inspects the given content scanner, i.e. only the content stream it has been initialized for, in your case a page content stream.

    From such a context stream other content streams may be referenced, e.g. a form XObject. To catch all the usages of interesting color spaces in a generic document, you have to recursively inspect such dependent content streams, too.

  • The code ignores the current transformation matrix.

    The current transformation matrix can be changed by an instruction to have all the drawings done by following instructions have their coordinates changed according to an affine transformation. To get all coordinates and dimensions right in a generic document, you have to apply the current transformation matrix to them.

  • The code ignores save-graphics-state/restore-graphics-state instructions.

    The current graphics state (including fill color and current transformation matrix) can be stored on a stack and restored from it. To get colors, coordinates and dimensions right in a generic document, you have to keep track of saved and restored graphics states (or use data from the cs.State for color and transformation where PDF Clown does this for you).

  • The code only looks at Separation color spaces.

    If you're interested in other color spaces, too, you have generalize this.

  • The code only understands very specific, trivial paths: only paths generated by a single instruction defining a rectangle.

    For a generic solution you have to support arbitrary paths.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • yeah @mkl I was referring to those coloured rectangles thinking of them as ColorSpaces, but what I need is those coloured rectangles location and the color used to fill them in this file. As PDFs may contain more graphic contents (vector graphic generated CorelDraw like editors) In this [file](https://nofile.io/f/3MQY1zYmSrR/b_01.pdf) how can we find the region occupied by the background (that triangular background behind the text and the man, assuming it as a graphics)? Is this related to the paths you are saying? Anyways, you are so helpful and many thanks for the help and inputs. – ksa May 23 '19 at 18:16
  • Currently `nofile.io` appears not to be reachable. I'll check again later. – mkl May 27 '19 at 16:35
  • @ksa I checked again, the link does not resolve anymore. – mkl Jun 08 '19 at 19:28