PDF Table Structure

Question

I have a PDF file with tabular structure but I am not able to store it in database as the PDF file is in Mangal font.

So two problems occur to me:

Extract table data from PDF
Text is in Marathi language

I have managed to do this for English with the following code:

ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); 
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy); 
text.Append(currentText); 
string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));

This encoding gives tabular structure but only for English font, want to know for Marathi.

Is the PDF a Tagged PDF? If not, well, that explains why you can't extract the table data. Did the person creating the document store the information that is necessary to extract the Marathi text correctly inside the PDF the moment he made the ligatures? If not, how do you suppose any software is able to extract the correct bytes in the correct order? — Bruno Lowagie, Nov 20 '17 at 08:32

score 1 · Answer 1 · edited Nov 20 '17 at 14:53

1

Funnily enough, requirement no. 1 is actually the hardest.

In order to understand why, you need to understand PDF a bit. PDF is not a WYSIWYG format. If you open a PDF file in notepad (or notepad++), you'll see that it doesn't seem to contain any human-readable information.

In fact, PDF contains instructions that tell a viewer program (like Adobe) how to render the PDF.

So instead of having an actual table in there (like you might expect in an HTML document), it will contain stuff like:

draw a line from .. to ..
go to position ..
draw the characters '123'
set the font to Helvetica bold
go to position ..
draw a line from .. to ..
draw the characters '456'
etc

See also How does TextRenderInfo work in iTextSharp?

In order to extract the table from the PDF, you need to do several things.

implement IEventListener (this is a class that you can attach to a Parser instance, a Parser will go over the entire page, and notify all listeners of things like TextRenderInfo, ImageRenderInfo and PathRenderInfo events)
watch out for PathRenderInfo events
build a datastructure that tracks which paths are being drawn
as soon as you detect a cluster of lines that is at roughly 90° angles, you can assume a table is being drawn
determine the biggest bounding box that fits the cluster of lines (this is know as the convex hull problem, and the algorithm to solve it is called the gift wrapping algorithm)
now you have a rectangle that tells you where (on the page) the table is located.
you can now recursively apply the same logic within the table to determine rows and columns
you can also keep track of TextRenderInfo events, and sort them into bins depending on the rectangles that fit each individual cell of the table

This is a lot of work. None of this is trivial. In fact this is the kind of stuff people write phd theses about.

iText has a good implementation of most of these algorithms in the form of the pdf2Data tool.

edited Nov 20 '17 at 14:53

Bruno Lowagie

75,994
9
109
165

answered Nov 20 '17 at 14:38

Joris Schellekens

8,483
2
23
54

*"Funnily enough, requirement no. 1 is actually the hardest."* - There actually are loads of documents where requirement no. 2 also is hard, effectively forcing one to resort to OCR... – mkl Nov 20 '17 at 17:50
I implemented OCR for iText, and structure recognition for iText. Give me OCR any day :) – Joris Schellekens Nov 21 '17 at 12:06
I have extracted tabular structure of pdf but what the problem is either i am able to extract text which don't have tabular structure or if tabular structure is extracted then not be able to extract marathi font.. help me please – user1358401 Nov 28 '17 at 08:42
Like I said in the above answer, PDF does not normally keep an internal state that allows you to extract which piece of text is at which row/column. So you will not be able to extract text in that way. – Joris Schellekens Nov 28 '17 at 13:37
I have done this task for english pdf but not able to do for marathi language – user1358401 Dec 01 '17 at 03:56
Then share your code for the English version. It will make this whole discussion a lot easier. – Joris Schellekens Dec 01 '17 at 17:26
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy); text.Append(currentText); string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1))); this encoding gives tabular structure but only for english font, want to know for marathi – user1358401 Dec 04 '17 at 09:06
1

Do not add code in comments. It doesn't come out as legible. – Joris Schellekens Dec 04 '17 at 14:09
ORC is for images only or we can use it for text in PDF also?? – user1358401 Dec 07 '17 at 10:32
Getting this type of content in raw pdf... how to decode it <0012> Tj <004B0057005700530056001D0012001200500044004B00440045004B0058004F0048004E004B001100500044004B0044005500440056004B0057005500440011004A00520059> Tj <005C036C009200800221009A> Tj – user1358401 Dec 07 '17 at 10:56
You should use IEventListener for that. It will automatically decode instructions like those and turn them into method calls that tell you "a line is being drawn" or "text is being drawn" or "an image was inserted". That way you don't have to deal with this low-level syntax. – Joris Schellekens Dec 08 '17 at 10:46
And OCR is currently an internal proof of concept. The idea would be to convert scanned PDF documents into "normal" PDF. – Joris Schellekens Dec 08 '17 at 10:48
I am getting marathi text but some words are not showing correctly. Like अधिकार as अ\0धकार महाराष्ट्र as महारा\0\0 क्षेत्र as \0े\0 – user1358401 Dec 14 '17 at 11:26

score -1 · Answer 2 · edited Jan 04 '18 at 14:56

Code:

ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); 
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i+1, strategy); 
string rawPdfContent = Encoding.UTF8.GetString(Encoding.Convert(Encoding.UTF8, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));

Then I have identified lines (Horizontal and Vertical) from PDF. As for lines PDF has either re or m and l Keywords.

Then I worked for marathi text which I got from iTextSharp.

Then I merged both for desired location I extract the text using code-

Int64 width = Convert.ToInt64(linesVertical[5].StartPoint.X) - Convert.ToInt64(linesVertical[2].StartPoint.X);
                Int64 height = Convert.ToInt64(linesVertical[2].EndPoint.Y) - (Convert.ToInt64(linesVertical[2].StartPoint.Y));
System.util.RectangleJ rect = new System.util.RectangleJ(Convert.ToInt64(linesVertical[2].StartPoint.X), (800 - Convert.ToInt64(linesVertical[2].EndPoint.Y) + 150), width, height);
                RenderFilter[] renderFilter = new RenderFilter[1];
                renderFilter[0] = new RegionTextRenderFilter(rect);
                ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
                Owner_Name = PdfTextExtractor.GetTextFromPage(reader, 1, textExtractionStrategy);

PDF Table Structure

2 Answers2