PDF is not a WYSIWYG format.
It's not because you see a paragraph that a computer program is able to see it.
In fact, an untagged PDF might look like this (pseudo-pdf-code):
go to location 10, 700
set the active font to Times New Roman
set the fontsize to 12
set the color to black
draw the glyph 'H'
go to coordinate 10, 680
draw the glyphs 'Lorem'
As you can tell from the example, instructions don't need to draw the text in reading order.
So the first challenge you're facing is to identify paragraphs.
I worked at iText, I've talked to various people at Adobe.
Being able to recognize structure in an untagged PDF document is not considered an easy problem.
Once you do have this structure (to the level of 'these glyphs make up a line' and 'these lines make up a paragraph' etc), it's a matter of creating a StructureTree
But since this usecase (re-tagging a PDF) was never thought possible, iText (or any other PDF library to my knowledge) isn't really designed to allow you to (easily) do this.
A tag itself is a part of separate datastructure inside the PDF.
Tags can have children (for instance to indicate 'this paragraph contains these lines').
A tag itself will reference the objects (groups of instructions) that are part of it.
So you might have:
- these instructions (to render a line of text) make up a word and form an object
- these word objects are aggregated (by a tag) into a line object
- a few line tags are aggregated into a paragraph object
For a thorough understanding, I recommend reading the PDF spec.