As a newbie of pdfbox user, I plan to extract data in a table, but tables with special formats, say with merged column headers should be processed with the help of table's borderlines. Therefore, the coordinates of the text and at least the table's horizontal borderlines should be extracted.
In order to extract the text from the table, I used PDFTextStripper
to get the list of TextPosition
objects; in order to extract the horizontal lines from the same page, I used PDFGraphicsStreamEngine
to extract the list of stroked GeneralPath
objects, and inside the stroked GeneralPath
object, there is the corresponding Rectangle2D
object representing the line (height = 0). But it seems that the aforementioned coordinates of TextPosition
objects and the coordinates of GeneralPath
objects are not in the same quadrant but with different Y-axis ray starting from the same origin.
According to my investigation, the origin of the TextPosition
object is the top left corner, whereas the origin of the Rectangle2D
is the bottom left corner, and the direction of each of the Y-axis differs from each other.
First, I would like to confirm that my investigation is right. If so I would like to get some hint about how to make the coordinates of Rectangle2D
and TextPosition
into the same quadrant.
Thanks in advance