Questions tagged [pdf]

Portable Document Format (PDF) is an open standard for electronic document exchange maintained by the International Organization for Standardization (ISO). Questions can be about creating, reading, editing PDFs using different languages.

The official ISO Specification (ISO 32000-1, a.k.a. 'PDF-1.7') is important as a reference, but it is not exactly written for PDF beginners.

Beginners may start with these two easy-to-read resources:

Related Tags

, , , , , , , , , , , , , , , , , , , ,

Questions

Related questions on Stack Overflow generally fall into the following domains:

  • How to convert, produce, or encode a PDF with , , etc.?
  • Everything else.

The first domain has been covered in depth, and any question you have is likely already answered.

Information Extraction

Extracting text from a PDF may not be possible without resorting to Optical Character Recognition (OCR). Letters can be encoded as font glyphs, line art, vector graphics, or raster images.

PDF files generally contain drawing instructions. There's no such thing as "a table" in most PDF files. There are lines, glyphs, and raster images (and clipping, and color spaces, and so forth). It is all but impossible to determine what is or isn't a table in an arbitrary PDF file.

Note that a glyph is not a character. A glyph has an appearance; whereas, a character has meaning. Each font in a PDF may or may not map its glyphs to characters.

If at all possible, use the source data to extract information, rather than relying on the PDF. This file format is designed for visual consistency, and very little useful normalized data can be extracted from its contents.

Content

A PDF file is often a combination of vector graphics, text, and bitmap graphics. The basic types of content in a PDF are:

  • text stored as content streams (i.e. not text)
  • vector graphics for illustrations and designs that consist of shapes and lines
  • raster graphics for photographs and other types of image

Related Links

For additional information about this file format see:

50972 questions
287
votes
15 answers

How to search contents of multiple pdf files?

How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep can't search PDF files.
Jestin Joy
  • 3,711
  • 4
  • 19
  • 14
270
votes
15 answers

Merge PDF files

Is it possible, using Python, to merge separate PDF files? Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure. And I may be pushing my luck, but is it possible to…
Btibert3
  • 38,798
  • 44
  • 129
  • 168
242
votes
8 answers

Is it possible to embed animated GIFs in PDFs?

Is it possible to embed animated GIFs in PDFs? And how might I go about such a thing? are there any dangers I should be aware of? For some more details on why I think it's a good thing and how it helps feel free to see this post. I didn't think it…
Joe
  • 4,367
  • 7
  • 33
  • 52
235
votes
8 answers

How do I convert Word files to PDF programmatically?

I have found several open-source/freeware programs that allow you to convert .doc files to .pdf files, but they're all of the application/printer driver variety, with no SDK attached. I have found several programs that do have an SDK allowing you to…
Shaul Behr
  • 36,951
  • 69
  • 249
  • 387
228
votes
11 answers

How to render a PDF file in Android

Android does not have PDF support in its libraries. Is there any way to render PDF files in the Android applications?
alexleutgoeb
  • 3,123
  • 3
  • 20
  • 31
219
votes
10 answers

How do I convert a PDF document to a preview image in PHP?

What libraries, extensions etc. would be required to render a portion of a PDF document to an image file? Most PHP PDF libraries that I have found center around creating PDF documents, but is there a simple way to render a document to an image…
Mathew Byrne
  • 3,713
  • 5
  • 25
  • 23
219
votes
17 answers

How to get rid of blank pages in PDF exported from SSRS

I have a two-page SSRS report. When I exported it to PDF it was taking 4 pages due to its width, where the 2nd and 4th pages were displaying one of my fields from the table. I tried to set the layout size in report properties as width=18in and…
brijit
209
votes
19 answers

Force to open "Save As..." popup open at text link click for PDF in HTML

I have some big size PDF catalogs at my website, and I need to link these as download. When I googled, I found such a thing noted below. It should open the "Save As..." popup at link click...
designer-trying-coding
  • 5,994
  • 17
  • 70
  • 99
197
votes
6 answers

vs.
Which is the right/best tag to use in my HTML file when I want to display the Adobe PDF viewer? Right now I'm using the code below, but there are weird side effects (e.g. it seems to steal the starting focus that I've set to another text…
JayhawksFan93
  • 2,353
  • 3
  • 15
  • 11
194
votes
8 answers

How can I extract embedded fonts from a PDF as valid font files?

I'm aware of the pdftk.exe utility that can indicate which fonts are used by a PDF, and wether they are embedded or not. Now the problem: given I had PDF files with embedded fonts -- how can I extract those fonts in a way that they are re-usable as…
simplybest55
187
votes
15 answers

How to extract text from a PDF?

Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the…
Budda007
  • 1,903
  • 2
  • 12
  • 3
184
votes
10 answers

What is the smallest possible valid PDF?

Out of simple curiosity, having seen the smallest GIF, what is the smallest possible valid PDF file?
meshy
  • 8,470
  • 9
  • 51
  • 73
181
votes
17 answers

Extract a page from a pdf as a jpeg

In python code, how can I efficiently save a certain page of a PDF as a JPEG file? Use case: I have a Python flask web server where PDFs will be uploaded and JPEGs corresponding to each page are stored. This solution is close, but the problem is…
vishvAs vAsuki
  • 2,421
  • 2
  • 18
  • 19
180
votes
9 answers

How to convert a Markdown file to PDF

I have a Markdown file that I wish to convert to PDF so that I can upload it on Speakerdeck. I am using Pandoc to convert from markdown to PDF. My problem is I can't specify what content should go on what page of the PDF, because Markdown doesn't…
Akshar Raaj
  • 14,231
  • 7
  • 51
  • 45
170
votes
11 answers

PDFtk Server on OS X 10.11

I've been using PDFTK Server on OSX pre 10.11 for over a year without any issues running commands on the command line. After installing OSX 10.11 beta, I can no longer run any PDFTK Server commands on the command line. It does not throw any error,…
Aaron
  • 3,068
  • 2
  • 21
  • 44