1

I am using PDFBOX and itextsharp dll and processing a pdf. so that I get the text coordinates of the text within a rectangle. the rectangle coordinates are extracted using the itextsharp.dll. Basically I get the rectangle coordinates from itextsharp.dll, where itextsharp uses the coordinates system as lower left. And I get the pdf page text from PDFBOX, where PDFBOX uses the coordinates system as top upper left. I need help in converting the Coordinates from lower left to upper left

Updating my question

Pardon me if you didn't understood my question and if not full information was provided.

well, Let me try to give more details from start.

I am working on a tool where I get a PDF in which a rectangle is drawn using some Drawing markups within a comment section. Now I am reading the rectangle coordinates using iTextsharp

PdfDictionary pageDict = pdReader.GetPageN(page_no);
PdfArray annotArray = pageDict.GetAsArray(PdfName.ANNOTS);

where pdReader is PdfReader.

And the page text along with its coordinates is extracted using PDFBOX. where as I have a class created pdfBoxTextExtraction in this I process the text and coordinate such that it returns the text and llx,lly,urx,ury "line by line" please note line by line not sentence wise.

So I want to extract the text that lays within the Rectangle coordinates. I got stuck when the coordinates of the rectangle returned from itextsharp i.e llx,lly,urx,ury of a rectangle has an origin at lower left where as the text coordinates returned from PDFBOX has an origin at upper left .then I realised I need to adjust the y-axis so that the origin moves from lower left to upper left. for the I got the height of the page and height of the cropbox

iTextSharp.text.Rectangle mediabox = reader.GetPageSize(page_no);
iTextSharp.text.Rectangle cropbox = reader.GetCropBox(page_no);

Did some basic adjustment

lly=mediabox.Top - lly

ury=mediabox.Top - ury

in some case the adjustment worked, whereas in some PDFs needed to do adjustment on cropbox

lly=cropbox .Top - lly

ury=cropbox .Top - ury

where as on some PDFs didn't worked.

All I need is help in adjusting the rectangle coordinates so that I get the text within the rectangle.

Community
  • 1
  • 1
RAHIL KAZI
  • 13
  • 1
  • 6
  • 1
    Y' = Ymax - Y. X' = X - Xmin. – mkl Dec 31 '14 at 10:00
  • Hmm, first I was going to say that `X' = X - Xmin` isn't relevant in this context, but it might be if that's how PdfBox "thinks". I'll update my answer once more. – Bruno Lowagie Dec 31 '14 at 10:36
  • As far as i remember PDFBox for text extraction uses 0,0 as upper left. I haven't checked, though. – mkl Dec 31 '14 at 11:31
  • Following your comments to @Bruno's answer I'm afraid there is quite some more information required to actually help you along. Please provide some code and sample PDF files to illustrate. – mkl Dec 31 '14 at 12:58
  • Concerning the edit: **a** the PDFs for which the mediabox variant worked... Did the cropbox variant also work for them? **b** the PDFs for which neither worked... Can you share sample documents? – mkl Jan 02 '15 at 08:23
  • hi @mkl.. sorry but cant share the sample documents.
    I guess i have found the adjustments for y-axis and the code is running properly. Currently i am testing on various PDFs
    Will post the adjustments once testing is done
    – RAHIL KAZI Jan 02 '15 at 11:01
  • btw thank you mkl and @Bruno's. Thanks alot....clear most of the topics regarding PDF – RAHIL KAZI Jan 02 '15 at 11:04
  • @mkl how do i provide you the sample file.. In this PDF itextsharp text coordination fails to extract proper text with coordination – RAHIL KAZI Jan 07 '15 at 06:43
  • *how do i provide you the sample file* - Share the PDF using e.g. public shares on google drive or dropbox and post the link here. – mkl Jan 07 '15 at 08:41
  • @mkl [Check the pdf here](https://drive.google.com/file/d/0B_YDw6_pdjfdQmpvUjAxVEg5ZDQ/view).. can you please review this pdf.. here the text extraction of the page is including extra space between the words – RAHIL KAZI Jan 07 '15 at 09:25
  • @BrunoLowagie [PDF FILE](https://drive.google.com/file/d/0B_YDw6_pdjfdQmpvUjAxVEg5ZDQ/view).. bruno can you please help out with such instance of PDFs where i am getting extra space between the word using itextsharp's PdfTextExtractor.GetTextFromPage method – RAHIL KAZI Jan 07 '15 at 09:27
  • That PDF is weird: its page has the boxes `/CropBox[0 0 684 855]/BleedBox[27 27 657 828]/MediaBox[36 36 648 819]/TrimBox[36 36 648 819]`, i.e. a larger crop box than media box. This usually makes no sense and may befuddle your code. – mkl Jan 07 '15 at 11:04
  • yes @mkl i have noticed this. So is this the reason that the iTextsharp does not extract the text properly, as the PDF is not in the proper format ?? – RAHIL KAZI Jan 07 '15 at 11:19
  • The PDF content is even weirder... `-.232 Tc [( P)-226.2(r)-231.8(e)-230.8(f)-238(a)-238.9(c)-228.9(e)]TJ` - First setting the character spacing to -(Width of space) and then adding distance again explicitly. I think iText upon seeing those big gaps assumes there to be free space between the characters and presents it as a space. – mkl Jan 07 '15 at 11:20
  • @mkl ... can you please suggest anything that can help me from such cases of files...?? or any suggestions that would help me out – RAHIL KAZI Jan 07 '15 at 11:25
  • @mkl [the attached images are output generated from itext and pdfbox](https://drive.google.com/folderview?id=0B_YDw6_pdjfdMV9HaVRTc2FjazQ&usp=sharing).. although pdfbox provides proper output as it reads the single charcter and their coordinates. i have created a method through which i merge the charter to words and than to sentence – RAHIL KAZI Jan 07 '15 at 11:31
  • @BrunoLowagie The `LocationTextExtractionStrategy` already removes the character spacing at the end of a text chunk for determining the chunk width. Unfortunately it uses `renderInfo.getSingleSpaceWidth()` to determine the space width (used in later comparisons with gap sizes), and that method includes the character spacing. Thus, the assumed space width is 0 here and the smallest gap is considered a space. This can be fixed. – mkl Jan 07 '15 at 12:01
  • @RAHILKAZI *can you please suggest anything that can help me from such cases of files* - unfortunately a (small) improvement of the `LocationTextExtractionStrategy` is necessary, see my comment to Bruno. – mkl Jan 07 '15 at 12:03
  • If @RAHILKAZI is a customer of iText Software, he should post a ticket to our issue tracker in which case, we'll look at it on the very short term. If he's not a paying customer, we'll put it on the TODO list with a low priority because the PDF is not what you could call a "normal PDF". It's more a *garbage in, garbage out* problem than a bug. – Bruno Lowagie Jan 07 '15 at 12:05

2 Answers2

1

The coordinate system in PDF is defined in ISO-32000-1. This ISO standard explains that the X-axis is oriented towards the right, whereas the Y-axis has an upward orientation. This is the default. These are the coordinates that are returned by iText (behind the scenes, iText resolves all CTM transformations).

If you want to transform the coordinates returned by iText so that you get coordinates in a coordinate system where the Y axis has a downward orientation, you could for instance subtract the Y value returned by iText from the Y-coordinate of the top of the page.

An example: Suppose that we are dealing with an A4 page, where the Y coordinate of the bottom is 0 and the Y coordinate of the top is 842. If you have Y coordinates such as y1 = 806 and y2 = 36, then you can do this:

y = 842 - y;

Now y1 = 36 and y2 = 806. You have just reversed the orientation of the Y-axis using nothing more than simple high-school math.

Update based on an extra comment:

Each page has a media box. This defines the most important page boundaries. Other page boundaries may be present, but none of them shall exceed the media box (if they do, then your PDF is in violation with ISO-32000-1).

The crop box defines the visible area of the page. By default (for instance if a crop box entry is missing), the crop box coincides with the media box.

In your comment, you say that you subtract llx from the height. This is incorrect. llx is the lower-left x coordinate, whereas the height is a property measured on the Y axis, unless the page is rotated. Did you check if the page dictionary has a /Rotate value?

You also claim that the values returned by iText do not match the values returned by PdfBox. Note that the values returned by iText conform with the coordinate system as defined by the ISO standard. If PdfBox doesn't follow this standard, you should ask the people from PdfBox why they didn't follow the standard, and what coordinate system they are using instead.

Maybe that's what mkl's comment is about. He wrote:

Y' = Ymax - Y. X' = X - Xmin.

Maybe PdfBox searches for the maximum Y value Ymax and the minimum X value Xmin and then applies the above transformation on all coordinates. This is a useful transformation if you want to render a PDF, but it's unwise to perform such an operation if you want to use the coordinates, for instance to add content at specific positions relative to text on the page (because the transformed coordinates are no longer "PDF" coordinates).

Remark:

You say you need PdfBox to get the text of a page. Why do you need this extra tool? iText is perfectly capable of extracting and reordering the text on a page (assuming that you use the correct extraction strategy). If not, please clarify.

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • hi bruno, In some PDF itextsharp failed to extract the text and their coordinates so PDBOX came in picture, well for geeting the page height there are two thing to be consider 1) Mediabox 2) Cropbox.. i am stuck somewhere here between. In some PDF the mediabox height is same as cropbox height. And when i substract the llx from height to transfrom Y-coordinate to the top of the page. the coordinates does not match the Text Coordinates returned from PDFBOX – RAHIL KAZI Dec 31 '14 at 10:20
  • Nevertheless: if the page is rotated, you have to take into account that X may become Y and Y may become X, so your typo led to an important extra caveat. – Bruno Lowagie Dec 31 '14 at 10:52
  • oops... Its LLY, Sorry for that mistake..And the lly is the cooridnates of the the rectangle,to be more specific. to transfrom the coords to Y-coordinate to the top of the page, i've done some adjustment to the cords of the rectangle ie Mediabox.top- lly-(Mediabox.top-cropbox.top). And there are some instance where the itextsharp returns extra space within the text thus the coordinates missmatch so PDFBOX is used it returns the coordinates properly. But the rectancle coords are returned from iTextsharp. So the cords of rectangle are to be transform to Y-coordinate to the top of the page – RAHIL KAZI Dec 31 '14 at 10:55
  • You cannot expect rectangle positions and text positions to coincide because text positions usually are given on the text baselines but some letters go below the baseline. Thus, your problem is not merely aa coordinate transformation but also the need to apply fuzzy checks. – mkl Dec 31 '14 at 11:38
  • @mkl: I don't understand a word of the comment by Rahil Kazi. I'm happy to see that you do. In any case RAHILKAZI: all the information that is returned by iText is correct. If another tool changes the correct information, then you are talking about a *feature*. That *feature* may or may not be available in iText, but before somebody can tell you if it is, you'll have to learn how to phrase your questions because in its current state, I have no clue what you mean by your comment. – Bruno Lowagie Dec 31 '14 at 11:53
  • 2
    I don't 100% understand his issue either, but his comment and question seem to indicate that he gets some rectangle data via iTextSharp and some text positions via PDFBox which don't match exactly. One assumption would be that this is due to text positions usually given by their base line start position. It's a guess, a pretty wild one. – mkl Dec 31 '14 at 12:56
  • Apart from the baseline, iText can also provide the ascender and the descender based on info retrieved from the font. Whatever PdfBox provides, it is based on information that is available to iText too. – Bruno Lowagie Dec 31 '14 at 14:39
0
          if ((mediabox.Top - mediabox.Height) != 0)
            {
                topY = mediabox.Top;
                heightY = mediabox.Height;
                diffY = topY - heightY;
                lly_adjust = (topY - ury) + diffY;
                ury_adjust = (topY - lly) + diffY;
            }
            else if ((cropbox.Top - cropbox.Height) != 0)
            {
                topY = mediabox.Top;
                heightY = cropbox.Top;
                diffY = topY - heightY;
                lly_adjust = (topY - ury) - diffY;
                ury_adjust = (topY - lly) - diffY;

            }
            else
            {

                lly_adjust = mediabox.Top - ury;
                ury_adjust = mediabox.Top - lly;

            }

These are final adjustment done

RAHIL KAZI
  • 13
  • 1
  • 6
  • Is there a reason why you prefer the media box? I would have thought the crop box would be the main box to look at. – mkl Jan 07 '15 at 11:06