1

As a newbie of pdfbox user, I plan to extract data in a table, but tables with special formats, say with merged column headers should be processed with the help of table's borderlines. Therefore, the coordinates of the text and at least the table's horizontal borderlines should be extracted.

In order to extract the text from the table, I used PDFTextStripper to get the list of TextPosition objects; in order to extract the horizontal lines from the same page, I used PDFGraphicsStreamEngine to extract the list of stroked GeneralPath objects, and inside the stroked GeneralPath object, there is the corresponding Rectangle2D object representing the line (height = 0). But it seems that the aforementioned coordinates of TextPosition objects and the coordinates of GeneralPath objects are not in the same quadrant but with different Y-axis ray starting from the same origin.

According to my investigation, the origin of the TextPosition object is the top left corner, whereas the origin of the Rectangle2D is the bottom left corner, and the direction of each of the Y-axis differs from each other.

First, I would like to confirm that my investigation is right. If so I would like to get some hint about how to make the coordinates of Rectangle2D and TextPosition into the same quadrant.

Thanks in advance

Rui
  • 3,454
  • 6
  • 37
  • 70
  • Somebody else than me should answer that one, this is more an "explain" than a programming question... Maybe it is already answered somewhere. Your observation in the second to last paragraph is right. See the DrawTextLocations example. IIRC, the y coordinate is substracted from the height of the cropbox of the page. – Tilman Hausherr Aug 15 '16 at 20:01
  • Vau, seems that you already gave the answer here :D I mistook the ArtBox instead of CropBox, but fortunately in the API docs ArtBox section told its default is just the CropBox :) – Rui Aug 16 '16 at 07:37
  • I looked at the [DrawPrintTextLocations](https://github.com/apache/pdfbox/blob/2.0.2/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java), seems that the transformation is implemented with only one line of code, i.e. line 136: `GeneralPath p = r.transform(Matrix.getTranslateInstance(-cropBox.getLowerLeftX(), cropBox.getLowerLeftY()));` would like to confirm about it – Rui Aug 16 '16 at 10:09
  • No, that one is to adjust the value of bead coordinates.Beads have nothing to do with fonts, these are set of rectangles to identify one article. Because the image is the cropbox, the coordinates must be adjusted as well. – Tilman Hausherr Aug 16 '16 at 10:14
  • Also besides the origin and direction of Y-axis, after the transformation, does the `scale` of `TextPosition` and the `scale` of extracted `Rectangle2D` (line) matter? – Rui Aug 16 '16 at 10:30
  • No it doesn't matter. – Tilman Hausherr Aug 16 '16 at 10:40
  • Seems that the AffineTransform object can help. Would like to confirm :) – Rui Aug 16 '16 at 10:43
  • Sorry, no idea what you mean with "the AffineTransform object", there are many (although flipAT might be for you). The best is "learning by doing", i.e. implement your stuff and see if it works. If it doesn't, compare it with the example, by running the example on your own PDF. – Tilman Hausherr Aug 16 '16 at 11:17
  • Btw I suggest you delete this question (as the answer was already in the question), or answer it yourself. This to avoid too much "orphans". – Tilman Hausherr Aug 16 '16 at 11:19
  • I will delete it once I got the code done and no problem occurs – Rui Aug 16 '16 at 16:54
  • @Rui, A long time ago, certainly, but I've also got a use-case where the horizontal lines are needed to properly extract the data out of a table. – Dale Feb 11 '23 at 01:43

0 Answers0