2

I want to read the some content pdf files. I just started before getting into the stuff I just want to know what the right approach to do so. ItextSharp reader may be helpful in that case, so I converted the pdf into text using:

public static string pdfText(string path)
{
    PdfReader reader = new PdfReader(path);
    string text = string.Empty;
    for(int page = 1; page <= reader.NumberOfPages; page++)
    {
        text += PdfTextExtractor.GetTextFromPage(reader,page);
    }
    reader.Close();
    return text;
}

I'm still wondering if this approach seems OK, or if I should convert this pdf into excel and then read the content which I want instead.

Professionals thoughts will be appreciated.

  • Converting the PDF to Excel is basically the same as what you are doing here. However, you may require OCR in special cases as not all pdfs contain text. I don't know if PdfTextExtractor does exactly that. – Psi Apr 06 '17 at 11:32
  • In my case, all pdf's contain text, I just need some content. –  Apr 06 '17 at 11:36
  • What I mean: Even if they contain text that is readable to the human eye, it may not be represented as real characters in the pdf. However, if you are sure that the pdf contains the plain text, I don't see why your approach should be not ok, except that I would recommend a `StringBuilder` to append the text – Psi Apr 06 '17 at 11:38

1 Answers1

0

With iText, you can also choose a specific strategy for extracting the text. But keep in mind that this is always a heuristic process.

Pdf documents essentially contain only the instructions needed to render the document for a viewer. So there is no concept of "text". More something like "draw character A at position 420, 890".

In order for any text-extraction to work, it needs to make some guesses on when two characters are close enough together that they should be concatenated, and when they should be apart.

Coincidentally, iText does this based on the width of a single space character in the font that is being used.

Keep in mind there could also be ActualText (this is a sort of text that gets hidden in the document, and is only used in extraction. It makes it possible to have the document render a character like "œ" (ligature version), which gets extracted as "oe" (non ligature version).

Depending on your input documents, you might want to look into the different implementations of ITextExtractionStrategy.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
  • That's not completely true: PDF _can_ contain plain text (otherwise copy&paste would not work in pdf documents where it is applicable). And they even can embed fonts (also: partially) used to render the contained text. However, they also _can_, as you describe, contain only pre-rendered text exported as curves. So it is not correct to say that it is always a heuristic process. This is only true for pdf files with text exported as curves. – Psi Apr 06 '17 at 19:50
  • Copy/paste works because the viewer applies the same heuristic in reconstructing the text as I described. Of course, I only described the most rudimentary case. – Joris Schellekens Apr 07 '17 at 07:16
  • I am very baffled by that answer and not sure that you know what you are talking about. A PDF viewer _performs OCR_ to allow for _copy & paste_? Wow... if you are serious in that, better read about the PDF format – Psi Apr 07 '17 at 07:18
  • I never mentioned OCR. This is an example of instructions in a pdf document [", 17.1965, P, -18.7118, i, -9.35592, l, -9.35592, o, -17.2414, t, -9.35636, ", 17.1965, , 250] TJ This instruction places characters on the canvas, in a font (specified by earlier commands). From this point on, it should be clear that extracting text is not exactly easy. – Joris Schellekens Apr 07 '17 at 08:39
  • That is only _one way_ to include text. – Psi Apr 07 '17 at 08:52
  • Of course, and there are other ways. But most of the common ways rely on the principle of instructing the viewer to place some characters at some location. At which point we may have to start performing heuristics to determine what chunks come together to form words, and where to insert spaces. I never claimed to give an exhaustive list of how to insert text. Rather just the basics. – Joris Schellekens Apr 07 '17 at 08:57
  • Did you do the statistics? Where do you take your information from? As far as I can see, the most common way is to use pdf text blocks or pdf table cells to display text information unless your text is mainly a design element. – Psi Apr 07 '17 at 09:00
  • Mostly from documents that I've processed while offering support at iText. – Joris Schellekens Apr 07 '17 at 09:06
  • So you did the statistics then. – Psi Apr 07 '17 at 09:09