0

We have a project where we use pdf.js to render a PDF into webpage and it creates HTML container elements for the PDF pages. The content of the PDF is split as HTML span in the view.

Attached is the image which shows how pdf text is rendered in the view. It also shows, each span has a data-key does not corresponds to a line in PDF.

enter image description here

Now, I need a pdf reader for java which reads and breaks the content as span with data-key or just the span in the order.

There are lot of java libraries available to read PDF content which gets the content line by line but that does not solve my issue. I need a java library which could break the content equivalent to span in the view.

  • Why do you need this to be java? Can't you just group all the spans with the same value of the `top` css style attribute? – SpaceTrucker May 13 '22 at 18:38
  • We work pdf in the frontend and store some data. Later we do some analysis on these pdf's in the backend using java 8 so we need to read the content of the pdf in java. If you can see, the `span` has `data-key` and I need that for analysis. – Vishwas Anavatti May 13 '22 at 19:40
  • Are your PDFs marked? Or arbitrary? – mkl May 13 '22 at 19:59
  • Am not sure what marked or arbitrary pdf mean but the pdf is final and it cannot be edited. if that answers your question. – Vishwas Anavatti May 13 '22 at 20:49

0 Answers0