0

I am a newbee to PDFClown and need help in parsing my pdf contents.

My PDF has huge number of MarkedContents which is displayed when converted as Stream.

But i am not able to parse them into objects to extract the Path Information contained within, which is my objective.

Here is my code -

if(level.Contents[i] is MarkedContent)
{

 PdfDataObject ContentDataObj = level.Contents.BaseDataObject;
 PdfIndirectObject pdfIndirectObject = level.Contents.BaseDataObject.IndirectObject;

 PdfStream ContentStream = (PdfStream)ContentDataObj.Resolve();


 ContentParser contentParser = new ContentParser(ContentStream.GetBody(true).ToByteArray());
 IList<ContentObject> markerContentObjList = contentParser.ParseContentObjects();

 //Here i am getting only two Content Objects, where as the stream has so many distinct Marked Contents

 for (int k = 0; k < markerContentObjList.Count; k++)
 {

 }
}

Below is the DOM Inspector screenshot and Stream data

enter image description here

ss_mj
  • 167
  • 1
  • 15
  • Can you share enough code to make the code runable? And can you share your test pdf to reproduce the issue? – mkl Nov 25 '19 at 09:11
  • hi @mkl, unfortunately i can not share the original PDF and my code is incomplete. As i said, i am new to PDFClown and finding difficult to understand the hierarchy. I have however attached the PDFClown DOMInspector screenshot, with sample PDF Stream data, can you please help me out how to achieve to extract the Path Co-ordinates? Thanks in advance. – ss_mj Nov 25 '19 at 15:05
  • With only that screen shot I can merely guess, not test, but the screen shot already shows that both the content stream contents in there are broken. Thus, I assume that your problem is due to these defects. I'll explain the defects in an answer but that won't help you, PDF Clown requires valid inputs. – mkl Nov 25 '19 at 16:16
  • I would be grateful if you can hint why it is broken. Is it not a well formed PDF? The PDF reader is showing the contents correctly though. Please help me with your answer. Thanks. – ss_mj Nov 25 '19 at 16:24
  • *"The PDF reader is showing the contents correctly"* - PDF viewers tend to do many repairs of invalid content under the hub. To a certain degree this is ok as the human viewer looking at the rendered output usually recognizes whether this output is ok or somehow garbled. Automatic PDF processors, on the other hand, should not be so lax. In particular in use cases in which their output is further processed automatically without human intervention, there often is no plausibility check for the output, and anything incorrectly interpreted may result in completely broken databases or archives. – mkl Nov 25 '19 at 17:18

1 Answers1

1

In Short

There are multiple errors in the content streams of your PDF, in particular errors that close more objects than are opened. This most likely is causing the early stop of parsing. Even if it is not, PDF Clown would associate starts and ends of objects differently than intended. Thus, the only real fix of the issue is to ask the source of the documents to provide a non-broken version.

The First Content Stream

The screen shot you provided shows your first page content stream:

first content stream

The second content stream of that page exhibits the same issues as this one:

Non-Matching Starts and Ends of Marked Content Sequences

If we look at the marked content operators, we see

/OC /Heading BDC
...
EMC
EMC
/OC /Heading BDC
...
EMC

As you can see, there are two EMC operators for the first BDC. This is invalid. Confer ISO 32000-2 section 14.6 Marked content.

Invalid Fill Operator

Furthermore, there is a Fill operator directly following a text object:

BT
...
ET
f

This also is invalid, path painting operators are only allowed after a path object or a clipping path object, not after a text object. Confer ISO 32000-2 Figure 9 Graphics objects.

A Related PDF Clown Issue

Actually there is a bug in PDF Clown which makes processing of marked content with PDF Clown impossible anyway: PDF Clown assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap, see this answer for details. This assumption is wrong and results in incorrect graphic state contents as explained in that answer.

Thus, one should patch marked content support out of PDF Clown as explained there to at least have proper graphics state information. Thereafter, obviously, you cannot properly process marked content unless you add correct support for it yourself.

Why PDF Clown Stops at the End of the First Stream

As you observed, PDF Clown stops not after the extra EMC but instead at the end of the first content stream.

This is due to the PDF Clown issue explained above: Based on the assumption that marked content sections and save/restore graphics state blocks are properly contained in each other, PDF Clown simply makes EMC and Q close the most recently opened and still open marked content section or save/restore graphics state block without checking whether it matches alright.

Thus, it matches opening and closing operators in your stream like this:

[Start of page content]
.  q
.  .  /OC /Heading BDC
.  .  EMC
.  EMC
.  /OC /Drawing BDC
.  EMC
Q

So for PDF Clown that last Q does not match the initial q in the content but the start of page content itself.

I think that PDF Clown stops parsing here because it assumes it has found the end of page contents.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • That was a wonderful explanation. Thanks @mkl. Is there any other way possible to extract them? Probably by writing a custom function to iterate and scan through stream bytes line by line? I am just assuming here. I have multiple PDFs like this all seems to have same format. – ss_mj Nov 25 '19 at 17:18
  • As long as you are 100% sure the problem operators can be identified, e.g. the extra **EMC** does not match a **BMC** or **BDC** or the extra **EMC** follows another **EMC** while in your files that never happens by design, you can repair it. But if you don't have such clear criteria, you probably remove the wrong **EMC** and work on incorrect marked content information later. Thus, you should diligently check all those PDF content streams to find criteria that are correct. – mkl Nov 25 '19 at 17:36