3

I'm looking to generate PDF's from a Python application. They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)

I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.

I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.

I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.

Has anyone done this before? Am I being pedantic by not wanting to use a library?

Jetblackstar
  • 249
  • 2
  • 12
  • Im sure you can find the specification for PDF documents ... but its going to be a nightmare to do from scratch... why are you opposed to using a library? (of coarse someones done it before... they created a library to do it :P) – Joran Beasley Dec 20 '13 at 18:01
  • Objection is partly dependency hell problems, depending on the library. Developing on Ubuntu 12.04 but will need to move to testing on Windows for deployment and production on Win server 2008. For example I'm trying out PYX which uses LaTex libs heavily and has just required me to grab LaTex and Type1 fonts, which has it's own long list of dependencies on my machine summing up to 200meg extra. It feels very OTT, but I could be misinterpreting the required complexity. (Thanks for fast response btw) – Jetblackstar Dec 20 '13 at 18:23
  • I think reportlab just works with easy_install(or maybe pip) ... but thats neither here nor there ... just fair warning that doing it from scratch will be its own special kind of hell (allthough probably an interesting learning experience) – Joran Beasley Dec 20 '13 at 18:26
  • My previous breif look at report lab was it was quite complex, But I'll give it another go. – Jetblackstar Dec 20 '13 at 18:27
  • you may be right ... its been a long time since I installed mine ... (although I expect it to be much less hellish than trying to write PDF language from scratch) (the link at the bottom of my answer has some examples of pdf docs written from scratch) – Joran Beasley Dec 20 '13 at 18:28
  • 1
    See [Introduction to PDF](http://www.gnupdf.org/Introduction_to_PDF) for a simple "Hello World" PDF. Generating that by hand from Python is no different than in any other language, though you might want to use a generic templating language (like [Jinja2](http://jinja.pocoo.org/) or [Mako](http://www.makotemplates.org/)) to make your life easier. – Lukas Graf Dec 20 '13 at 19:21
  • @LukasGraf awesome link!!! holy cow thanks – Joran Beasley Dec 20 '13 at 19:51
  • @JoranBeasley yeah, Jetblackstar is definitely right about that the basics of PDF really arent' that complicated. Depending on what features you need it can be a simple matter of a clever stack of templates and you can do very efficient, lightweight PDF generation yourself. However, as soon as you need to deal with (compressed) binary data, streams, many xrefs, it gets *really* complicated, and you really should hand that task off to a library. – Lukas Graf Dec 20 '13 at 20:05

3 Answers3

4

As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.

MattDMo
  • 100,794
  • 21
  • 241
  • 231
4

borrowed from ask.yahoo

A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.

But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.

For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.

Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").

If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story. http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

but really you should probably just use a library

Thanks to @LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch

Community
  • 1
  • 1
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • 2
    If you borrowed some text from somewhere else, you should provide a link to the source as well as the simple attribution. – Hannele Dec 20 '13 at 18:13
  • there ya go :) you are correct of coarse ... I meant to ... but then I got stuck trying to find the correct link for the specification – Joran Beasley Dec 20 '13 at 18:15
  • I'm going to give this to Joran because he has given me benefit of the doubt at trying to do it by hand. Also a link to excellent docs I'd not found yet on PDF. This said I'm going to try and capitulate and use one of the available libraries. Much swearing may ensue as I fight them. However If I end up using Report Lab I will come back and note it here. Many thanks both! – Jetblackstar Dec 20 '13 at 18:35
  • Thanks. The "introduction-to-pdf" link does not seem to work anymore, but I guess [this example by Felix Schütt](https://github.com/fschutt/printpdf/wiki/1.1.1-Hello-World-PDF) is similar. – djvg Sep 03 '18 at 10:35
0

I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

Felipe Sierra
  • 143
  • 2
  • 12