0

I am using iText's LocationTextExtractionStrategy class to find a special character in a PDF, which seems to be working. I get an iText vector such as 94.5,698.9,1.0 (from getStartLocation()) and 100.5,698.9,1.0 (from getEndLocation). I want to know what the location is, expressed as a percent in relation to the page. The objective is to export the PDF page as an image and use it in a div on a web page, and use another div as an overlay (semi translucent) to highlight the area. For example, I can locate a person's name followed by the special character. I want to put a div right over that one spot where the special character is. Since I have the iText vector, if I can somehow convert that to the percent in relation to the PDF page then I can translate that info for use on a div. For example top 25.125%, left 30.55% - It's OK if the location is a few pixels off mark since I just want to highlight the general area (give or take about 5 pixels vertically or horizontally).

Michael M
  • 8,185
  • 2
  • 35
  • 51
  • The calculation for the relative distance from the left is `location.getEndLocation().get(0) / pageWidth) *100` The calculation for relative position from top still eludes me. – Michael M Aug 05 '13 at 23:44
  • Got it. Since PDF locations start at the bottom left one must subtract the top point from the height: `((pageHeight - location.getEndLocation().get(1)) / pageHeight) *100` – Michael M Aug 06 '13 at 00:14

1 Answers1

1
leftPercent = (location.getEndLocation().get(0) / pageWidth) *100,
topPercent = ((pageHeight - location.getEndLocation().get(1)) / pageHeight) *100

If you have a 8.5 inch (wide) by 11 inch (tall) document, then if the special character is at vector 152,594,1.0 then equation would calculate as follows

leftPercent = (152 / 612) *100 = 25% and topPercent = ((792 - 594) / 792) *100 = 25%

In my test case, I deliberately had the special character placed 25% from the top and 25% from the left.

Michael M
  • 8,185
  • 2
  • 35
  • 51
  • 2
    I upvoted your answer, but I want to clarify that your solution is only correct if the lower-left corner of the page has the coordinate 0, 0 (something that should be assumed for all PDFs). – Bruno Lowagie Aug 06 '13 at 06:38
  • Furthermore you should check which page are is used to *export the PDF page as an image*: Is it the media box? Or the crop box? or some other area? Depending on that the coordinates of the lower left of the image and its height and width may vary. – mkl Aug 06 '13 at 07:14
  • @mkl Not sure what you mean. Each page should be a separate image, exported with the same dimensions as a near perfect representation of how the document would look. I record the coordinates for each page so that I can present each image file separately and then apply the coordinates from my extrapolation in context to just the page being shown. It really does not matter if the height and width vary as long as the dimensions are the same because x% from top and y% from left translate to any size because it's relative. Am I missing something? Thanks! – Michael M Aug 06 '13 at 21:52
  • @BrunoLowagie. Thanks for checking my answer and bringing up that point. I am using docx4j to generate my PDFs. It uses [XSL-FO and ApacheFOP](http://webapp.docx4java.org/OnlineDemo/docx_to_pdf_fop.html). From my understanding, **although I am not 100% sure**, XSL-FO,["does not definitively describe the layout of the text on various pages"](http://en.wikipedia.org/wiki/XSL_Formatting_Objects) but when used with ApacheFPO's PDFDocumentHandler to produce PDFs, it considers that ["PDF uses the lower left as origin"](http://bit.ly/11KFVj4) - so I believe it will always be be case - yes? – Michael M Aug 06 '13 at 22:44
  • Look here: http://stackoverflow.com/questions/13236370/pdf-bleed-detection/13240546#13240546 – mkl Aug 07 '13 at 03:29
  • Note: The LocationTextExtractionStrategy parser does not locate text in the order of appearance on the document. I have been putting text into footers (.docx files) then converting them to PDF (with DOCX4J). I've found that parser will find text in, what was the .docx file's footer, then in the body section. – Michael M Jun 18 '14 at 23:49