0

I´m using PDFClown to analyze a PDF Document. In many documents it seems that some characters in PDFClown have different heights even if they obviously have the same height. Is there a workaround?

This is the Code:

    while(_level.moveNext()) {
        ContentObject content = _level.getCurrent();
        if(content instanceof Text) {
            ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)_level.getCurrentWrapper();
            for(ContentScanner.TextStringWrapper textString : text.getTextStrings()) {
                List<CharInfo> chars = new ArrayList<>();
                for(TextChar textChar : textString.getTextChars()) {
                    chars.add(new CharInfo(textChar.getBox(), textChar.getValue()));
                }
            }
        }
        else if(content instanceof XObject) {
            // Scan the external level
            if(((XObject)content).getScanner(_level)!=null){
                getContentLines(((XObject)content).getScanner(_level));
            }
        }
        else if(content instanceof ContainerObject){
            // Scan the inner level
            if(_level.getChildLevel()!=null){
                getContentLines(_level.getChildLevel());
            }
        }
    } 

Here is an example PDFDocument:

Example

In this Document I marked two text chunks which both contains the word "million". When analyzing the size of each char in both "million" the following happens:

  1. "m" in the first mark has the height : 14,50 and the width : 8,5
  2. "i" in the first mark has the height: 14,50 and thw width: 3,0
  3. "l" in the first mark has the height : 14,50 and the width 3,0
  4. "m" in the second mark has the height: 10,56 and the width: 6,255
  5. "i" in the second mark has the height: 10,56 and the width: 2,23
  6. "l" in the second mark has the height: 10,56 and the width: 2,23

Even if all chars of the two text chunks obviously have the same size pdf clown said that the sizes are different.

Jannik
  • 90
  • 10
  • Can you share a document and indicate glyphs in it which *obviously have the same height* but for which PDF Clown claims otherwise? An by height you mean the height of the `textChar.getBox()`? – mkl Jul 31 '17 at 12:35
  • **(A)** Please supply a PDF with which one can reproduce the issue. A screenshot hardly helps at all. And as you say that that issue occurs in many documents, it should be easy to provide a PDF without sensitive data. **(B)** Even if a screenshot would suffice, the first rectangle in yours seems to be cut off; probably there are some bigger letters before that "million,"... – mkl Jul 31 '17 at 14:35
  • A) Okay I will search for a PDF without sensitive data and B) There are no bigger letters before the first "million". It only looks like that. – Jannik Jul 31 '17 at 14:49
  • @mkl: Sorry for beeing late but I have finally uploaded an example – Jannik Aug 09 '17 at 12:02
  • I'll look at it later. A funny file sharing service, I got both your PDF and a xml file stating "All access to this object has been disabled"... – mkl Aug 09 '17 at 14:26
  • @mkl : Haha :D Sorry I have chosen the first service I had seen! – Jannik Aug 11 '17 at 06:54
  • No problem. At least I got the pdf, there are some file services with so many ads you don't find the download button... ;) I'm still looking into the problem. I could reproduce it but don't know the cause yet. – mkl Aug 11 '17 at 09:01
  • So... I've been looking into that whenever I had some time to spare. Indeed, internally PDF Clown falsely assumes that the latter "million" is drawn with a smaller size. While debugging the code for that I came across an architectural error in PDF Clown: it wrongly assumes that tagged content respects save/restore graphics state structures. This results in wrong assumptions concerning where state is restored. I'm not yet sure whether that is the cause of the false font size but it may be. I'm afraid, though, that one has to throw away the tagged content handling for proper parsing results. – mkl Aug 13 '17 at 08:41

1 Answers1

1

The issue is caused by a bug in PDF Clown: it assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap. I.e. it assumes that these structures only intermingle as

begin-marked-content
save-graphics-state
restore-graphics-state
end-marked-content

or

save-graphics-state
begin-marked-content
end-marked-content
restore-graphics-state

but never as

save-graphics-state
begin-marked-content
restore-graphics-state
end-marked-content

or

begin-marked-content
save-graphics-state
end-marked-content
restore-graphics-state.

Unfortunately this assumption is wrong, marked content sections and save/restore graphics state blocks can intermingle any way they like.

E.g. in the document at hand there are sequences like this:

q
[...1...]
/P <</MCID 0 >>BDC 
Q
[...2...]
EMC

Here [...1...] is contained in the save/restore graphics state block enveloped by q and Q and [...2...] is contained in the marked content block enveloped by /P <</MCID 0 >>BDC and EMC.

Due to the wrong assumption, though, and the way /P <</MCID 0 >>BDC and Q are arranged, PDF Clown parses the above as [...1...] and an empty marked content block and [...2...] being contained in a save/restore graphics state block.

Thus, if there are changes in the graphics state inside [...2...], PDF Clown assumes them limited to the lines above while they actually are not.


The only easy way I found to repair this was to disable the marked content parsing in PDF Clown.

To do this I changed org.pdfclown.documents.contents.tokens.ContentParser as follows:

  1. In parseContentObjects() I disablked the contentObject instanceof EndMarkedContent option:

      public List<ContentObject> parseContentObjects(
        )
      {
        final List<ContentObject> contentObjects = new ArrayList<ContentObject>();
        while(moveNext())
        {
          ContentObject contentObject = parseContentObject();
          // Multiple-operation graphics object end?
          if(contentObject instanceof EndText // Text.
            || contentObject instanceof RestoreGraphicsState // Local graphics state.
           /* || contentObject instanceof EndMarkedContent // End marked-content sequence. */
            || contentObject instanceof EndInlineImage) // Inline image.
            return contentObjects;
    
          contentObjects.add(contentObject);
        }
        return contentObjects;
      }
    
  2. In parseContentObject I removed the if(operation instanceof BeginMarkedContent) branch:

      public ContentObject parseContentObject(
        )
      {
        final Operation operation = parseOperation();
        if(operation instanceof PaintXObject) // External object.
          return new XObject((PaintXObject)operation);
        else if(operation instanceof PaintShading) // Shading.
          return new Shading((PaintShading)operation);
        else if(operation instanceof BeginSubpath
          || operation instanceof DrawRectangle) // Path.
          return parsePath(operation);
        else if(operation instanceof BeginText) // Text.
          return new Text(
            parseContentObjects()
            );
        else if(operation instanceof SaveGraphicsState) // Local graphics state.
          return new LocalGraphicsState(
            parseContentObjects()
            );
     /*   else if(operation instanceof BeginMarkedContent) // Marked-content sequence.
          return new MarkedContent(
            (BeginMarkedContent)operation,
            parseContentObjects()
            );
     */   else if(operation instanceof BeginInlineImage) // Inline image.
          return parseInlineImage();
        else // Single operation.
          return operation;
      }
    

With these changes in place, the character sizes are properly extracted.


As an aside, while the returned individual character boxes seem to imply that the box is completely custom to the character in question, that is not true: Merely the width of the box is character specific, the height is calculated from overall font properties (and the current font size) but not specifically to the character, cf. the org.pdfclown.documents.contents.fonts.Font method getHeight(char):

  /**
    Gets the unscaled height of the given character.

    @param textChar
      Character whose height has to be calculated.
  */
  public final double getHeight(
    char textChar
    )
  {
    /*
      TODO: Calculate actual text height through glyph bounding box.
    */
    if(textHeight == -1)
    {textHeight = getAscent() - getDescent();}
    return textHeight;
  }

Individual character height calculation still is a TODO.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Sorry I was busy the last weeks ! But now I tried your approach and it seems that it works ! Thanks a lot !! Amazing work !:) – Jannik Aug 30 '17 at 08:41
  • Testing your approach with the above example works. But now I´m getting a NullPointer in my Project when working with those new Informations. `:java.lang.NullPointerException at org.pdfclown.documents.contents.fonts.CompositeFont.loadEncoding(CompositeFont.java:189)` – Jannik Aug 30 '17 at 08:46
  • 1
    @Jannik The changes proposed in my answer do not directly interfere with font loading (indirectly they do: Without them the *save/restore graphics state block* recognition was wrong, so a wrong font could be considered to be the current one; with them those blocks are recognized correctly, so the *current font* might now be different, more exactly: the correct one). Thus, your observation quite likely has uncovered either an issue in the PDF or another problem in PDF Clown. Please make that a question in its own right with enough information to reproduce the issue. – mkl Aug 30 '17 at 09:12
  • Thank you ! I will make a question for this issue within the next days ! Amazing work!:) – Jannik Aug 30 '17 at 09:18