Funnily enough, requirement no. 1 is actually the hardest.
In order to understand why, you need to understand PDF a bit.
PDF is not a WYSIWYG format. If you open a PDF file in notepad (or notepad++), you'll see that it doesn't seem to contain any human-readable information.
In fact, PDF contains instructions that tell a viewer program (like Adobe) how to render the PDF.
So instead of having an actual table in there (like you might expect in an HTML document), it will contain stuff like:
- draw a line from .. to ..
- go to position ..
- draw the characters '123'
- set the font to Helvetica bold
- go to position ..
- draw a line from .. to ..
- draw the characters '456'
- etc
See also How does TextRenderInfo work in iTextSharp?
In order to extract the table from the PDF, you need to do several things.
- implement IEventListener (this is a class that you can attach to a Parser instance, a Parser will go over the entire page, and notify all listeners of things like TextRenderInfo, ImageRenderInfo and PathRenderInfo events)
- watch out for PathRenderInfo events
- build a datastructure that tracks which paths are being drawn
- as soon as you detect a cluster of lines that is at roughly 90° angles, you can assume a table is being drawn
- determine the biggest bounding box that fits the cluster of lines (this is know as the convex hull problem, and the algorithm to solve it is called the gift wrapping algorithm)
- now you have a rectangle that tells you where (on the page) the table is located.
- you can now recursively apply the same logic within the table to determine rows and columns
- you can also keep track of TextRenderInfo events, and sort them into bins depending on the rectangles that fit each individual cell of the table
This is a lot of work. None of this is trivial. In fact this is the kind of stuff people write phd theses about.
iText has a good implementation of most of these algorithms in the form of the pdf2Data tool.