0

Is this possible to get the height of the page content using pdfbox? I think I tried everything but each (PDRectangle) returns full height of the page: 842. First I thought that this is because the page number place at the bottom of the page, but when I opened pdf in Illustrator, the whole content is inside compound element, and isn't extended to the whole page height. So if illustrator can see it as separate element and calculate its height, I guess this should also be possible in pdfbox.

Sample page:

enter image description here

El Kopyto
  • 1,157
  • 3
  • 18
  • 33
  • If the document had been created with illustrator... Illustrator leaves its own, proprietary information in the document from which it may show some compound element. If you share the PDF in question, we may tell whether there is any corresponding PDF structure or whether that is a mere Illustrater'ism. – mkl Feb 04 '15 at 14:30
  • The PDF is generated by app, nothing to do with Illustrator, which was used just to inspect the pdf. – El Kopyto Feb 04 '15 at 14:49
  • The it might be an xobject or a clip path or anything like that. If you can share the PDF... – mkl Feb 04 '15 at 15:07
  • Here he the PDF: http://d.pr/f/137PF the text in boxes can have multiple lines so the size isn't constant. Is this possible to get the position and the size of this header (white + gray box)? – El Kopyto Feb 04 '15 at 15:58
  • At first glance I can only see a clip path there. Content stream analysis is required to find those. I'll look into it some more tomorrow, back in office. – mkl Feb 04 '15 at 16:42
  • The main PDF was changed (generated by phantomjs now): http://d.pr/f/15uBF but the problem remains: I'm not able to identify the last shape and its position in the header. – El Kopyto Feb 05 '15 at 10:12

1 Answers1

1

In general

The PDF specification allows a PDF to provide a number of page boundaries, cf this answer. Aside from them content boundaries may only be derived from page contents, e.g. from

  • Form XObjects:

    A form XObject is a PDF content stream that is a self-contained description of any sequence of graphics objects (including path objects, text objects, and sampled images). A form XObject may be painted multiple times—either on several pages or at several locations on the same page—and produces the same results each time, subject only to the graphics state at the time it is invoked.

  • Clipping Paths:

    The graphics state shall contain a current clipping path that limits the regions of the page affected by painting operators. The closed subpaths of this path shall define the area that can be painted. Marks falling inside this area shall be applied to the page; those falling outside it shall not be.

  • ...

To find either of them, one has to parse the page content, look for the appropriate operations, and calculate the resulting boundaries.

In the OP's case

Each of your sample PDFs defines explicitly only one page boundary, the MediaBox. Thus, all of the other PDF page boundaries (CropBox, BleedBox, TrimBox, ArtBox) default to it. So it is no wonder that in your attempts

each (PDRectangle) returns full height of the page: 842

Neither of them contains form XObjects, but both make use of clipping paths.

  • In case of test-pdf4.pdf:

    Start at: 28.31999969482422, 813.6799926757812
    Line to: 565.9199829101562, 813.6799926757812
    Line to: 565.9199829101562, 660.2196655273438
    Line to: 28.31999969482422, 660.2196655273438
    Line to: 28.31999969482422, 813.6799926757812
    

    (This might match the sketch in your question.)

  • In case of test-pdf5.pdf:

    Start at: 23.0, 34.0
    Line to: 572.0, 34.0
    Line to: 572.0, -751.0
    Line to: 23.0, -751.0
    Line to: 23.0, 34.0
    

    and

    Start at: 23.0, 819.0
    Line to: 572.0, 819.0
    Line to: 572.0, 34.0
    Line to: 23.0, 34.0
    Line to: 23.0, 819.0
    

Due to the match with the sketch I would assume that Illustrator considers everything drawn while a non-trivial clipping path is in effect, a compound element with the clipping path as border.

Finding clipping paths with PDFBox

I used PDFBox to find the clipping paths mentioned above. I used the current SNAPSHOT of the version 2.0.0 now under development as the required APIs have been much improved compared to the current release version 1.8.8.

I extended PDFGraphicsStreamEngine to a ClipPathFinder class:

public class ClipPathFinder extends PDFGraphicsStreamEngine implements Iterable<Path>
{
    public ClipPathFinder(PDPage page)
    {
        super(page);
    }

    //
    // PDFGraphicsStreamEngine overrides
    //
    public void findClipPaths() throws IOException
    {
        processPage(getPage());
    }

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        startPathIfNecessary();
        currentPath.appendRectangle(toFloat(p0), toFloat(p1), toFloat(p2), toFloat(p3));
    }

    @Override
    public void drawImage(PDImage pdImage) throws IOException { }

    @Override
    public void clip(int windingRule) throws IOException
    {
        currentPath.complete(windingRule);
        paths.add(currentPath);
        currentPath = null;
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        startPathIfNecessary();
        currentPath.moveTo(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        currentPath.lineTo(x, y);
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        currentPath.curveTo(x1, y1, x2, y2, x3, y3);
    }

    @Override
    public Point2D.Float getCurrentPoint() throws IOException
    {
        return currentPath.getCurrentPoint();
    }

    @Override
    public void closePath() throws IOException
    {
        currentPath.closePath();
    }

    @Override
    public void endPath() throws IOException
    {
        currentPath = null;
    }

    @Override
    public void strokePath() throws IOException
    {
        currentPath = null;
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        currentPath = null;
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        currentPath = null;
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException
    {
        currentPath = null;
    }

    void startPathIfNecessary()
    {
        if (currentPath == null)
            currentPath = new Path();
    }

    Point2D.Float toFloat(Point2D p)
    {
        if (p == null || (p instanceof Point2D.Float))
        {
            return (Point2D.Float)p;
        }
        return new Point2D.Float((float)p.getX(), (float)p.getY());
    }

    //
    // Iterable<Path> implementation
    //
    public Iterator<Path> iterator()
    {
        return paths.iterator();
    }

    Path currentPath = null;
    final List<Path> paths = new ArrayList<Path>();
}

It uses this helper class to represent paths:

public class Path implements Iterable<Path.SubPath>
{
    public static class Segment
    {
        Segment(Point2D.Float start, Point2D.Float end)
        {
            this.start = start;
            this.end = end;
        }

        public Point2D.Float getStart()
        {
            return start;
        }

        public Point2D.Float getEnd()
        {
            return end;
        }

        final Point2D.Float start, end; 
    }

    public class SubPath implements Iterable<Segment>
    {
        public class Line extends Segment
        {
            Line(Point2D.Float start, Point2D.Float end)
            {
                super(start, end);
            }

            //
            // Object override
            //
            @Override
            public String toString()
            {
                StringBuilder builder = new StringBuilder();
                builder.append("    Line to: ")
                       .append(end.getX())
                       .append(", ")
                       .append(end.getY())
                       .append('\n');
                return builder.toString();
            }
        }

        public class Curve extends Segment
        {
            Curve(Point2D.Float start, Point2D.Float control1, Point2D.Float control2, Point2D.Float end)
            {
                super(start, end);
                this.control1 = control1;
                this.control2 = control2;
            }

            public Point2D getControl1()
            {
                return control1;
            }

            public Point2D getControl2()
            {
                return control2;
            }

            //
            // Object override
            //
            @Override
            public String toString()
            {
                StringBuilder builder = new StringBuilder();
                builder.append("    Curve to: ")
                       .append(end.getX())
                       .append(", ")
                       .append(end.getY())
                       .append(" with Control1: ")
                       .append(control1.getX())
                       .append(", ")
                       .append(control1.getY())
                       .append(" and Control2: ")
                       .append(control2.getX())
                       .append(", ")
                       .append(control2.getY())
                       .append('\n');
                return builder.toString();
            }

            final Point2D control1, control2; 
        }

        SubPath(Point2D.Float start)
        {
            this.start = start;
            currentPoint = start;
        }

        public Point2D getStart()
        {
            return start;
        }

        void lineTo(float x, float y)
        {
            Point2D.Float end = new Point2D.Float(x, y);
            segments.add(new Line(currentPoint, end));
            currentPoint = end;
        }

        void curveTo(float x1, float y1, float x2, float y2, float x3, float y3)
        {
            Point2D.Float control1 = new Point2D.Float(x1, y1);
            Point2D.Float control2 = new Point2D.Float(x2, y2);
            Point2D.Float end = new Point2D.Float(x3, y3);
            segments.add(new Curve(currentPoint, control1, control2, end));
            currentPoint = end;
        }

        void closePath()
        {
            closed = true;
            currentPoint = start;
        }

        //
        // Iterable<Segment> implementation
        //
        public Iterator<Segment> iterator()
        {
            return segments.iterator();
        }

        //
        // Object override
        //
        @Override
        public String toString()
        {
            StringBuilder builder = new StringBuilder();
            builder.append("  {\n    Start at: ")
                   .append(start.getX())
                   .append(", ")
                   .append(start.getY())
                   .append('\n');
            for (Segment segment : segments)
                builder.append(segment);
            if (closed)
                builder.append("    Closed\n");
            builder.append("  }\n");
            return builder.toString();
        }

        boolean closed = false;
        final Point2D.Float start;
        final List<Segment> segments = new ArrayList<Path.Segment>();
    }

    public class Rectangle extends SubPath
    {
        Rectangle(Point2D.Float p0, Point2D.Float p1, Point2D.Float p2, Point2D.Float p3)
        {
            super(p0);
            lineTo((float)p1.getX(), (float)p1.getY());
            lineTo((float)p2.getX(), (float)p2.getY());
            lineTo((float)p3.getX(), (float)p3.getY());
            closePath();
        }

        //
        // Object override
        //
        @Override
        public String toString()
        {
            StringBuilder builder = new StringBuilder();
            builder.append("  {\n    Rectangle\n    Start at: ")
                   .append(start.getX())
                   .append(", ")
                   .append(start.getY())
                   .append('\n');
            for (Segment segment : segments)
                builder.append(segment);
            if (closed)
                builder.append("    Closed\n");
            builder.append("  }\n");
            return builder.toString();
        }
    }

    public int getWindingRule()
    {
        return windingRule;
    }

    void complete(int windingRule)
    {
        finishSubPath();
        this.windingRule = windingRule;
    }

    void appendRectangle(Point2D.Float p0, Point2D.Float p1, Point2D.Float p2, Point2D.Float p3) throws IOException
    {
        finishSubPath();
        currentSubPath = new Rectangle(p0, p1, p2, p3);
        finishSubPath();
    }

    void moveTo(float x, float y) throws IOException
    {
        finishSubPath();
        currentSubPath = new SubPath(new Point2D.Float(x, y));
    }

    void lineTo(float x, float y) throws IOException
    {
        currentSubPath.lineTo(x, y);
    }

    void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        currentSubPath.curveTo(x1, y1, x2, y2, x3, y3);
    }

    Point2D.Float getCurrentPoint() throws IOException
    {
        return currentPoint;
    }

    void closePath() throws IOException
    {
        currentSubPath.closePath();
        finishSubPath();
    }

    void finishSubPath()
    {
        if (currentSubPath != null)
        {
            subPaths.add(currentSubPath);
            currentSubPath = null;
        }
    }

    //
    // Iterable<Path.SubPath> implementation
    //
    public Iterator<SubPath> iterator()
    {
        return subPaths.iterator();
    }

    //
    // Object override
    //
    @Override
    public String toString()
    {
        StringBuilder builder = new StringBuilder();
        builder.append("{\n  Winding: ")
               .append(windingRule)
               .append('\n');
        for (SubPath subPath : subPaths)
            builder.append(subPath);
        builder.append("}\n");
        return builder.toString();
    }

    Point2D.Float currentPoint = null;
    SubPath currentSubPath = null;
    int windingRule = -1;
    final List<SubPath> subPaths = new ArrayList<Path.SubPath>();
}

The class ClipPathFinder is used like this:

PDDocument document = PDDocument.load(PDFRESOURCE, null);
PDPage page = document.getPage(PAGENUMBER);
ClipPathFinder finder = new ClipPathFinder(page);
finder.findClipPaths();

for (Path path : finder)
{
    System.out.println(path);
}

document.close();
Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Is this doable in 1.8.8? – El Kopyto Feb 05 '15 at 15:30
  • The `PDFGraphicsStreamEngine` (from which the `ClipPathFinder` is derived) is a more generic offshoot of the PDFBox 1.8.8 `PageDrawer` base. Probably one can tweak that `PageDrawer class to serve the same purpose as `PDFGraphicsStreamEngine`. Else one has to do more copy&paste and create one's own graphics stream engine based on the 1.8.8 `PDFStreamEngine`. To cut a long story short: It is possible but you have to find/create a replacement for the `PDFGraphicsStreamEngine` base class. – mkl Feb 05 '15 at 16:22
  • Looking at the code some more... I think I would try to backport `PDFGraphicsStreamEngine` and all its `GraphicsOperatorProcessor` classes if I needed the functionality in 1.8.8. – mkl Feb 05 '15 at 16:50
  • I wonder if it wouldn't be easier to set artbox (or other boundary) in phantomjs, which could be then retrieved... – El Kopyto Feb 05 '15 at 17:19
  • *grin*, yes, using these constructs indeed would be best for post-processing. – mkl Feb 05 '15 at 20:03
  • @mkl, wonderful answer. What I understood from the PDF specs is that a path that has a clipping operation before the final painting operation will be painted, unless that path ends with a `n` operator. Is that correct? – rivu Jun 20 '16 at 18:47
  • Also, it seemed your code can be used to see the lines and curves in a PDF page (just checking how many are there, not the bounding box or other info, which I guess can be added later) by just adding `paths.add(currentPath);` to the path painting methods and `segments.add(new Line(currentPoint,start));` to the SubPath.closePath method. I tried this with the Adobe PDF spec (http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf), page 6 and it worked. – rivu Jun 21 '16 at 04:34
  • However, I noticed something weird for page 5 in this PDF (https://drive.google.com/file/d/0B65bQnJhC1mvbEVQQ0o0QU9STlU/view?usp=sharing). Using the modified code mentioned above I got 956 lines and 48 curves. I also wrote a small code ( http://pastebin.com/ehtPXgY2 ) to count the number of lines and curves using PDFStreamParser and that gave me 1084 lines, 53 curves and 25 rectangles (i.e., 1184 lines in total). So I am a bit confused about what is happening. – rivu Jun 21 '16 at 04:34
  • @rivu Please also share your changed `Path` and `ClipPathFinder` classes to reproduce the result. And please also make it a question in its own right, it is difficult to properly discuss code in comments alone. – mkl Jun 21 '16 at 07:22
  • @mkl, I have added a new question. Thanks so much for getting back. http://stackoverflow.com/questions/37950491/getting-line-and-curve-paths-using-pdfbox-different-results-extending-pdfstream – rivu Jun 21 '16 at 17:06
  • @mkl, sorry, I have a new question :| at http://stackoverflow.com/questions/38005345/confusion-about-current-transformation-matrix-in-a-pdf. Would you please have a look? Sorry to bother you so many times. – rivu Jun 24 '16 at 03:56