1

I have a source pdf(untagged.pdf) out of which I would be creating a tagged version(tagged.pdf)

I have information of all the html tags of all contents of the source pdf.

Now I have a figure on page 3. When I programmatically parse, this will not be detected as an image but this is a rectangle with some text and another rectangle like below.

    _____________________         ____________________
   |    Some text inside | ----> |   Some other text  |
   |                     | ----> |            Inside  |
   |_____________________| ----> |____________________|

             Fig 1.x Rectangle 1 to Rectangle 2

Using some other techniques, I have detected this is a figure and bounding coordinates of the same. Lets say the bounding coordinates is [10, 30] and [100, 60], I want to tag the whole thing as a figure(like below)

   _____________________________________________________________(100, 60)
  |                                                             |
  |      _____________________         ____________________     |
  |     |    Some text inside | ----> |   Some other text  |    |
  |     |                     | ----> |            Inside  |    |
  |     |_____________________| ----> |____________________|    |
  |                                                             |
  |           Fig 1.x Rectangle 1 to Rectangle 2                |
  |_____________________________________________________________|
  (10, 30)

Now I want to tag this the entire section as an image. I have checked libraries like itextpdf or pdfbox. They dont have APIs to tag a figure using coordinates.

In other words, are there any ways to tag an element(group of images) as a figure programmatically.

General Grievance
  • 4,555
  • 31
  • 31
  • 45
SuperNova
  • 25,512
  • 7
  • 93
  • 64
  • have you checked if you can *identify* the image section by using something like [pdf2data](https://pdf2data.online/) from iText? You can try it online without any code. Otherwise, I'd suggest you post the PDF file you're working on so that someone can take a look at it. – André Lemos Feb 08 '19 at 08:05
  • I have identified the image bounding box in pdf. I have to tag them as image. – SuperNova Feb 08 '19 at 08:19
  • is it possible you provide an example PDF so I can see what you are trying to achieve/tag? If you are confortable with the PDF structure, you can also check [RUPS](https://itextpdf.com/en/products/rups-reading-and-updating-pdf-syntax) to see how your PDF is being structured, and then use a similar approach as the one described on [this post](https://itextpdf.com/en/resources/examples/itext-7/tagged-pdf-adding-alt-structure-tree). – André Lemos Feb 08 '19 at 08:29
  • thanks for the reply, It is not about that specific pdf or image. I am trying to build a generic solution, wherein I want to tag an element using its coordinates. – SuperNova Feb 08 '19 at 08:41

0 Answers0