12

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the resulting document isn't OCR'ed in this particular scenario the way it would be if the original PDF document had been created from an image.

To get a bit more detailed, I am trying to understand how text extractors for PDFs work. My initial understanding of PDF was that it stored (PostScript) instructions on how to draw the "image" of the document to a page or a printer, and that there was no actual text contained within the document itself. Subsequently, I was thinking that a text extractor might reverse-engineer such instructions to generate the text that the PDF would otherwise generate. I am not confident of this, though.

Kenneth K.
  • 2,987
  • 1
  • 23
  • 30

1 Answers1

13

PDF contains several different types of objects; not only vectorial or raster drawing instructions. Text in in particular is represented by text elements. These include a string of characters that should be drawn at certain positions using a specific font.

Text extraction from PDFs can be a complicated affair because the file format is oriented for page layout. A text element may be an entire paragraph, or a single character. Even a single word may consist of several text elements if different typefaces are mixed. Also, the characters are not necessarily encoded in a standard encoding such as Unicode. They may be encoded in a way specific to a particular font.

If you are lucky enough to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot easier because text spans are identified as such, and a mapping to Unicode characters is defined.

Wikipedia doesn't have the complete specification but does serve as an introduction: http://en.wikipedia.org/wiki/Portable_Document_Format#Text

Joni
  • 108,737
  • 14
  • 143
  • 193
  • So is it safe to say that because the text element merely tells the rendering engine what to draw where that this would be the reason why there is no context when you extract text from a PDF? – Kenneth K. Mar 25 '13 at 19:30
  • 1
    You can say that. PDF says "here's a block of text" but it doesn't tell you if it's a paragraph, a title, or a table. This makes extracting pure text from PDF complicated. – Joni Mar 25 '13 at 19:33
  • 1
    @Joni, it can get worse than that and you may have a PDF with reduced font information, in wich case you cannot even tell which unicode or ansi text character belongs to a particular PDF-character. It can also get better and you may have a tagged PDF, which may contain paragraph/title/line information, but in a general purpose app you cannot assume anything. – yms Mar 25 '13 at 19:53
  • Thanks @yms, I'll make a note of that. – Joni Mar 25 '13 at 20:12
  • 2
    It might be worth looking at the Text section of the [PDF Reference](http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf) too, if you really want to get deep into how it works and is stored. – Lyndon Armitage Mar 26 '13 at 08:40
  • @LyndonArmitage I did begin to read the Text section of the spec. I was really only trying to confirm something I had been spouting off at the office (regarding a PDF *not* storing text, but rather the instructions for drawing something that would end up resembling text). I have since confirmed that I was mistaken :) When I searched for articles describing how PDFs store text, I didn't find anything that was straight to the point (like mark stephens articles). My initial search for the spec turned up the ISO website and a cost of $250. The answer I sought wasn't that important! – Kenneth K. Mar 26 '13 at 11:35