0

I converted a PDF to XML using pdftohtml from the Poppler utils. This gives the co-ordinates for the text in the PDF. I also converted PDF to image, using the convert tool from ImageMagick. When I search for the same coordinate in the image I do not find the text pointed to by the XML:

The first link shows the text marked "BILL TO" at top=182. The second Link shows the same text "BILL TO" but the coordinates are different.

My question is: how do I find the relation between the coordinates from both XML and image format?

Any help would be appreciated.

Ri4a
  • 640
  • 7
  • 13
Bandi
  • 1
  • 1
  • 1
    HOW did you convert to xml? HOW did you convert to image? You will most likely need a coordinate transformation the details of which depend on the details of the complete answers to those clarification requests. – mkl Oct 15 '18 at 06:51
  • For Convert the PDF to XML I used command Line argument, "pdftohtml -c -hidden -xml ". Example: "pdftohtml -c -hidden -xml 8140.pdf yu.xml" For PDF to Image, I used: "convert [option] Example: convert xxx.pdf xxx.jpg I tried different quality for PDF>Image convertion but none solved my case. – Bandi Oct 15 '18 at 10:21
  • I google'd around a bit but could not find any `pdftohtml` documentation exhaustive enough to describe the coordinate system used in its outputs. But this is the information needed. Thus, as soon as you discover and share this information, we'll be able to formulate the correct transformation. – mkl Oct 15 '18 at 12:19
  • How do you know the coordinates are "different". Maybe they are just at different DPI, or the y-axis is flipped? Also, you may want to describe why you are doing this in the first place. There may be a much simpler way to accomplish what you want. – Ryan Oct 21 '18 at 16:04

1 Answers1

0

Using pdf2html with the option -xml will generate an xml file with a element for each page in the PDF. That element has attributes width and height. All elements inside the element have attributes left, top, width and height relative to those.

An A4 is 297mm or 11.693 inch. At 72 DPI (see here), this is 842 dots which is what pdfinfo will report. Unfortunately, pdftohtml has a default zoom of 1.5. So for an A4 page the height becomes 1263. So either you need to first multiply by 2/3, or use the -zoom 1 option.

ImageMagick convert will convert PDFs to images that have these same coordinates.

Ri4a
  • 640
  • 7
  • 13