49

I can not decrypt the data from the stream like:

    56 0 obj 
    << /Length 1242 /Filter /FlateDecode >>
    stream
    x]êΩnƒ Ñ{ûbÀKq¬æ\âê¢....(whole binary is omitted)
    endstream
    endobj

I tried isolate the binary content (x]êΩnƒ Ñ{ûbÀKq¬æ\âê¢....) in a file and in a binary string. Decoding function gzinflate($encripted_data) sends me error of decoding, and I think it happens because encoded content in not "deflated" or so.

In PDF Reference v 1.7, (six edition), on a page 67, I found the description of /FlateDecode filter as: ...Decompresses data encoded using the zlib/deflate compression method, reproducing the original text or binary data

I need real raw solution, aka php function or/and algorithm what to do with this "\FlateDecoded" stream.

Thank You!

Ruben Kazumov
  • 3,803
  • 2
  • 26
  • 39
  • Do you need this function for selected objects only or for all compressed streams (and all compression schemes)? – Kurt Pfeifle Jul 31 '12 at 01:17
  • Dear Kurt! I'll be glad to know how to deal with all kind of filters like: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt, but in real life, FlateDecode is the most used in PDF files which was produced by "print to PDF..."-s))), and now I really need to deal with this single filter. – Ruben Kazumov Jul 31 '12 at 01:35
  • You say *'I think it happens because encoded content is not "deflated" or so'*. -- That's why I gave you the hint about `qpdf` in my answer. You can use it (at least) to verify or falsify your own efforts, even if it turns out to not be meeting your direct requirements. Also your `56 0 obj`-object can be anything. If you don't tell from where in the PDF it is referenced as `56 0 R` there is no way to know if it is an ICC profile, a font, an image, some page content or something else... – Kurt Pfeifle Jul 31 '12 at 02:19
  • Dear Kurt! May be qpdf is good solution for taks, like this, but unfortunately, qpdf is the "shell" or command-line solution. Not my case. Bitte verzeih mir! Danke für die Hinweise! – Ruben Kazumov Jul 31 '12 at 02:56

5 Answers5

76

Since you didn't tell if you need to access one decompressed stream only or if you need all streams decompressed, I'll suggest you a simple commandline tool which does it in one go for the complete PDF: Jay Berkenbilt's qpdf.

Example commandline:

 qpdf --qdf --object-streams=disable in.pdf out.pdf

out.pdf can then be inspected in a text editor (only embedded ICC profiles, images and fonts could still be binary).

qpdf will also automatically re-order the objects and display the PDF syntax in a normalized way (and telling you in a comment what the original object ID of the de-compressed object was).

Should you require to re-compress the file again (maybe after you edited it), just run this command:

 qpdf out-edited.pdf out-recompressed.pdf

(You may see some warning message, telling that the utility was attempting to repair a damaged file....)

qpdf is multi-platform and available from Sourceforge.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • How can we re-compress pdf file, for example, after modifying a text in uncompressed file? – Kemal Dağ Feb 06 '15 at 14:31
  • Thanks. It re-compresses the original one. But, after opening it in adobe reader, it raises following error : "This document enabled extended features in Adobe Reader. The document has been changed since it was created and use of extended features is no longer available. Please contact the author for the original version of this document." There are fillable fields in the PDF form. Is there a method to modify PDF files without Adobe Reader to raise above error? Because Adobe disables fillable fields after recompression. – Kemal Dağ Feb 10 '15 at 12:22
  • And after doing just uncompressing and decompressing a file, the output file is different than the original input. Should not two be the same? – Kemal Dağ Feb 10 '15 at 12:27
  • @KemalDağ: Using QPDF to uncompress and re-compress all PDF objects will not restore the original PDF exactly. QPDF is providing *"content preserving"* transformations of a PDF. As I said, un-compressing does also *"re-order the objects"* and *"display the PDF syntax in a normalized way"*. Upon re-compressing it does not restore the original order of the objects (different order does not change the visibly rendered contents of the pages). – Kurt Pfeifle Oct 16 '16 at 19:34
  • @KemalDağ: Unfortunately, Adobe's PDF software uses a few proprietary *"extended features"* for fillable fields (this type of documents are called *PDF forms*). Basically, these documents require an Adobe (private) signature key, so they can only be processed by someone who has access to that key. It would be illegal to crack and re-use that key. – Kurt Pfeifle Oct 16 '16 at 19:39
  • Is adding the option `--object-streams=disable` really needed? Wouldn't it be better to keep its default value, "preserve"? – Gras Double Dec 02 '17 at 03:16
  • @GrasDouble: The purpose was to get the PDF source code into a shape which makes it *readable* and *grokkable*, while preserving its internal content. Object streams are one of the reasons why the code is un-grokkable. Hence that option. – Kurt Pfeifle Dec 02 '17 at 10:18
19
header('Content-Type: text');           // I going to download the result of decoding
$n = "binary_file.bin";                 // decoded part in file in a directory
$f = @fopen($n, "rb");                  // now file is mine
$c = fread($f, filesize($n));           // now I know all about it 
$u = @gzuncompress($c);                 // function, exactly fits for this /FlateDecode filter
$out = fopen("php://output", "wb");     // ready to output anywhere
fwrite($out, $u);                       // output to downloadable file

Jingle bells! Jingle bells!...

gzuncompress() - the solution

Ruben Kazumov
  • 3,803
  • 2
  • 26
  • 39
8

Long overdue, but someone might find it helpful. In this case: << /Length 1242 /Filter /FlateDecode >> all you need is to pass the isolated binary string (so basically everything between "stream" and "endstream") to zlib.decompress:

import zlib
stream = b"êΩnƒ Ñ{ûbÀKq¬æ\âê"  # binary stream here
data = zlib.decompress(stream) # Here you have your clean decompressed stream

However, if you have/DecodeParms in your PDF object thing become complicated. You will need the /Predictor value and columns number. Better use PyPDF2 for this.

Belial
  • 821
  • 1
  • 9
  • 12
  • 5
    The question is asking for PHP, this solution suggests to use Python. That's not a very good fit. Anyway, this may to obvious to you but not everyone else: You'll need to pass everything in between `stream` and `endstream` **except** the leading and trailing EOL markers. – IInspectable Jan 14 '17 at 15:05
0

i just used

import de.intarsys.pdf.filter.FlateFilter;

from jpod / source forge and it works well

FlateFilter filter = new FlateFilter(null);
byte[] decoded = filter.decode(bytes, start, end - start);

the bytes are straight from the pdf file

fedorqui
  • 275,237
  • 103
  • 548
  • 598
0

I wanted to add a more complete answer because I faced the same problem.

I found my answer in the source code of a well established PHP PDF parsing library: FPDI.

https://github.com/flagshipcompany/fpdf/blob/master/fpdi/src/pdf_parser.php#L878

I discovered there is multiple ways to encode a stream: '/FlateDecode', '/LZWDecode', '/ASCII85Decode', '/ASCIIHexDecode'.

For FlateDecode only, gzuncompress native PHP function is the key. For the others, FPDI source code contains decoder than you can reuse in your projects.

Tristan CHARBONNIER
  • 1,119
  • 16
  • 12