8

I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.

The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.

I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.

Could you recommend a good and working solution for the task?

Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
Alex
  • 1,237
  • 3
  • 18
  • 29
  • 1
    Are you targeting a certain platform/language? When you say "scan" to you mean "look through" or are you actually scanning a physical object and would therefor need OCR capabilities? – Chris Haas Nov 23 '11 at 14:27
  • Thanks for you reply! i program in .NET so any library that have a .net porting is good. but i know also JAVA so in extremis i could use a java library. anyway i don't need OCR. My PDF contain text and images. Text is rendered in the content-stream of PDF so i would nees some kind of parser/render that just tell me where a String should be rendered on a page. i just need the coords. – Alex Nov 30 '11 at 22:43

3 Answers3

4

Use XPDF (http://www.foolabs.com/xpdf/)

It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.

It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.

Nerdmaster
  • 4,287
  • 1
  • 22
  • 16
  • I've been using pdftotext for years and never twigged it had this feature! Never been able to work out how to easily extract coordinates before. – fred2 Jul 06 '15 at 18:14
  • 2
    Is this correct? The bbox option doesn't seem to work for me, and I can't find anything about it in the documentation. – jss Jan 11 '16 at 21:39
0

Several Java libraries can do this. Have you looked at JPedal or PdfBox?

mark stephens
  • 3,205
  • 16
  • 19
  • i just tried iTextSharp with the RenderListener. it seems to work but not very good. iTextSharp for my PDF return images with correct coords, but all the text layer have wrong coords. i think als my PDF have 2 text layer and iTextSharp dont give me coords. i tried to draw on an imagebox what iTextSharp return and i can see quickly that there are 3 layer (1 for images and 2 for text) and this layer are not aligned at all. – Alex Nov 30 '11 at 22:49
  • Could you share the code that you used to extract image coords ? renderImage is passed ImageRenderInfo. How do I extract coordinates from that ? – letronje Oct 06 '13 at 12:00
0

If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:

IacDocument.GetObjectsInRectangle Method

The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.

Usual disclaimer applies.

yms
  • 10,361
  • 3
  • 38
  • 68