4

I am required to work out a program that can analyse and understand contextual and semantic relations of the content in some PDFs with a particular structure and format.

Below is a sample, which shows a piece of content of such PDF: enter image description here

So I need a PDF reading library which can extract not only the text, but also the metedata in PDF, such as font size, font style(bold, itali), background color, table and its children elements, table cell background color, check box, element position and so on.

Is there any free library of .NET can do the job? Thanks so much.

PS: I am aware of this post: Extract Data from .PDF files, but the libraries' capbilities are not elaborated.

Community
  • 1
  • 1
VincentZHANG
  • 757
  • 1
  • 13
  • 31
  • From what little I know about the PDF, I would be very surprised if they contain high-level concepts like "table cell." – adv12 Nov 18 '16 at 03:01
  • The file's information is enough to tell the PDF reader to render some sells without background, and some with light green. That means at least there is something to indicate that. – VincentZHANG Nov 18 '16 at 03:18
  • Sure, there's at least a filled rectangle and an unfilled rectangle, but I wouldn't expect anything with more semantic meaning than that--that is, nothing to say, "This rectangle represents the header of the first column of a table." I've used various libraries to write out PDFs and have never found such high level concepts to be part of the APIs. – adv12 Nov 18 '16 at 03:22
  • I got your point, I just used the word 'table' to indicate that notation, it does not have to be the same table as in HTML, and filled/unfilled rect is informative enough to distinguish the meaning of the text for me I think, I might use wrong words to express it, but it doesn't matter that much. – VincentZHANG Nov 18 '16 at 03:40
  • 1
    Again, I would be very surprised if there were any notion of the filled rectangle "containing" text. Yes, there will be a filled rectangle and a block of text. But I would be surprised if there were any record of a relationship between the two. – adv12 Nov 18 '16 at 03:44
  • What you mean is that, there is no relationship between container and its content, right? Well, if so, is the position of those elements in page available? – VincentZHANG Nov 18 '16 at 03:46
  • Yes. You could find an object on the page and look for text within its bounding box. – adv12 Nov 18 '16 at 03:53
  • @adv12 Could you give me some clues how I can get check box and font size of a piece of text? Thanks so much. – VincentZHANG Nov 19 '16 at 03:44
  • Nope, I don't know how to do that. Good luck! – adv12 Nov 19 '16 at 03:45
  • iTextSharp is the way to go. good samples. broad set of capabilities. – Glenn Ferrie Nov 25 '16 at 05:13

2 Answers2

3

I don't have a quick answer, but I've spent the last two weeks solving this exact problem, with success. I used Apache PDFBox, which extracts PDF text to TextPositions. These TextPositions contain information about each character in the text (position, bold, italic, font, etc). I used this information to set up bounding boxes for all of the table elements and decifer things like text-alignment, column membership, etc, and then recreate the PDF page and it's tables in Excel, in just under 1000 lines of code.

I did not have to extract graphic elements like checkboxes, but Apache PDFBox does extract to COSStreams, and graphic and form elements can likely be parsed from those streams - I'm not there yet. My code would be able to rebuild the table you showed and would only be missing the checkboxes and background colors.

I've searched for a simpler solution than mine and came up short, it seems there's no easy way to do this.

EDIT: If this hasn't dissuaded you, I can show you how to begin. First, extend either PDFTextStripper or PDFTextStripperByArea. This gives you access to the TextPositions via the processTextPosition override - the following code shows how I transformed TextPositions into my own custom class TextChar. I then use relative textpositions to work out rudimentary contextual information:

public class PDFStripper : PDFTextStripper
    {
        private List<TextChar>[] tcPages;

        public PDFStripper(java.util.List pages)
        {
            int pagecount = pages.size();
            tcPages = new List<TextChar>[pagecount+1];
            base.processPages(pages);
        }

        protected override void processTextPosition(TextPosition tp)
        {
            PDGraphicsState gs = getGraphicsState();
            TextChar tc = BuildTextChar(tp, gs);
            int currentPageNo = getCurrentPageNo();
            if (tcPages.ElementAtOrDefault(currentPageNo) == null)
            {
                tcPages[currentPageNo] = new List<TextChar>();
            }
            tcPages[currentPageNo].Add(tc);
        }

        private static TextChar BuildTextChar(TextPosition tp, PDGraphicsState gstate)
        {
            TextChar tc = new TextChar();
            tc.Char = tp.getCharacter()[0];

            float h = (float)Math.Floor(tp.getHeightDir());
            tc.Box = new RectangleF
            (
                tp.getXDirAdj(),
                (float)Math.Round(tp.getYDirAdj(), 0, MidpointRounding.ToEven) - h, // adjusted Y to top
                tp.getWidthDirAdj(),
                h
            );

            tc.Direction = tp.getDir();
            tc.SpaceWidth = tp.getWidthOfSpace();

            tc.Font = tp.getFont().getBaseFont();
            tc.FontSize = tp.getFontSizeInPt();

            try
            {
                int[] flags =     
                     GetBits(tp.getFont().getFontDescriptor().getFlags());
                tc.IsBold = findBold(tp, flags, gstate);
                tc.IsItalic = findItalics(tp, flags);
            }
            catch { }

            return tc;
        }

        protected override void writePage() { return; } //prevents exception
    }
AndrewBenjamin
  • 651
  • 1
  • 7
  • 16
1

Add this NuGet package https://www.nuget.org/packages/TikaOnDotNet/ . It's the dotnet version of apache tika

then do this:

var extracted = new TikaOnDotNet.TextExtractor().Extract("file.pdf");
var text = extracted.Text;
var metaData = extracted.Metadata;

Good luck buddy :)

Dina
  • 937
  • 9
  • 12
  • Thank you for your answer, but sorry, those meta data is about the PDF file, not the content, please see the detail of my question, what I need to get is the information on the format of the content. – VincentZHANG Nov 18 '16 at 23:06
  • 1
    I see mate, I had not read your question thoroughly :), but just to give you an idea with what you need, I would perhaps give it a go to use a command line tools to call in my C# app to convert pdf to html then I would deal with the html result :) – Dina Nov 19 '16 at 08:51