Does anyone have a sample code demonstrating how to extract vector graphics objects (such as those representing charts and flow diagrams) from a PDF using XPDF library? There doesn't seem to be any documentation available on the Web for xpdf library nor could I find any any sample code that uses the library to extract information from PDF. I am going through xpdf's code base but any pointers to its documentation or a sample code would be very helpful.
Asked
Active
Viewed 776 times
1 Answers
0
OutputDev class has stroke, fill, clip ... virtual members definitions. Just implement those and extract path and colour information from GfxState. You'll find path iteration in OutputDev based classes in xpdf code base such as TextOutputDev or ImageOutputDev
edit: This outputdev may give you the example you need

user18428
- 1,216
- 11
- 17
-
1Thanks for the quick response! I find that OutputDev also has several virtual members such as drawImage, drawChar etc. Which of these methods would be invoked for vector graphic objects? For text objects, would both drawChar and stroke be called (since text characters can also be drawn as a set of strokes)? Gfx.cc in xpdf library seems to be invoking these methods in OutDev, but I am not sure if I am on the right track. And is it possible to identify whether an object that Gfx is trying to render is a text, a raster image or a vector object? – so1 Mar 22 '13 at 14:21
-
In addition, I would like to know how text present within vector graphic, such as text within a flowchart component, would be rendered. Would such text be treated any differently from text appearing outside any vector graphics, such as normal text regions? – so1 Mar 22 '13 at 14:28
-
It's a little bit more complex, generally you can rely on drawChar/drawString for text (but there are exceptions for some rare ps objects see TextOutputDev for reference) .Characters are not rendered as strokes or fills anyway.For images drawImage and its siblings (drawSoftMaskedImage ...) should do. – user18428 Mar 22 '13 at 14:33
-
You should consider accepting the answer if it was of any help – user18428 Mar 22 '13 at 17:01
-
Thanks, your replies were very helpful and timely! I have accepted the answer. One related question - would drawImage be used only for raster images and not for vector graphics? I will continue reading the code base tomorrow and should likely be able to find this out, but thought it would be helpful to know this ahead from someone who has prior experience. – so1 Mar 22 '13 at 20:30
-
drawImage and siblings are only used for raster data (with or without masking) . All vector operations are handled by OutputDev fill,stroke,clip,eofill and eoclip. In those methods you'll only have path data and colors (fill or stroke) to work with and no raster or text data. Good luck and feel free to ask if you have questions about the xpdf code base – user18428 Mar 22 '13 at 23:45
-
Thanks again! I would like to understand whether there is an easy way to identify text present within vector graphics in PDF. Will post a new question for this, as the current question has been voted down :( – so1 Mar 31 '13 at 14:05
-
1Unfortunately, my account seems to have been blocked from posting new questions because of the downvotes cast on the current question :( - strange given that I have never got downvotes on any of my previous questions. So I am asking my next question here - is it possible to identify whether a text extracted from PDF using OutputDev's drawChar() is a part of a vector graphic, for example the labels within flowchart components? Or would such text be represented as vector graphics (rather than as text) within PDF too? – so1 Mar 31 '13 at 14:25
-
1There is no such distinctions (graphic text vs text). If you want to locate text rendered above graphic drawings you will have to track bounding boxes of vector paths and compare them with text positions. – user18428 Mar 31 '13 at 15:52