0

I'm trying to decompress a stream from a PDF Object in this file:

 4 0 obj
<< 
/Filter /FlateDecode
/Length 64
>>
stream
xœs
QÐw34V02UIS0´0P030PIQÐpÉÏKIUH-.ITH.-*Ê··×TÉRp
á T‰
Ê
endstream
endobj

I have this stream copy-pasted with the same format as in the original file in a file called Stream.file

xœs
QÐw34V02UIS0´0P030PIQÐpÉÏKIUH-.ITH.-*Ê··×TÉRp
á T‰
Ê

This stream should translate to: Donde esta curro??. Added that stream to a Stream.file in a C# Console application.

using System.IO;
using System.IO.Compression;

namespace Filters
{
    public static class FiltersLoader
    {
        public static void Parse()
        {
            var bytes = File.ReadAllBytes("Stream.file");
            var originalFileStream = new MemoryStream(bytes);

            using (var decompressedFileStream = new MemoryStream())
            using (var decompressionStream = new DeflateStream(originalFileStream, CompressionMode.Decompress))
            {
                decompressionStream.CopyTo(decompressedFileStream);
            }    
        }
    }
}

However it yields an exception whil trying to copy it:

The archive entry was compressed using an unsupported compression method.

I'd like how to decode this stream with .net code if it's possible.

Thanks.

Fritjof Berggren
  • 3,178
  • 5
  • 35
  • 57
  • 2
    You can't copy-paste binary data is if it were text and expect this to go smoothly. There are many things that can go wrong there, even if by some stroke of luck you use an encoding that contains single code points all the characters. You'll need to be a little more wily and write a little program to read the file until it gets to the `>>stream` (or just try until you get the offset right) and extract the bytes as true binary content. (That's aside from whether you can actually use `DeflateStream` here; I don't know PDF well enough to say if that's right.) – Jeroen Mostert Oct 01 '19 at 19:52
  • You were partly right, I didn't realize while copying it added Windows return lines, but after changing it to Unix LF and making sure both streams look the same in https://mh-nexus.de/en/hxd/ and still having the same issue. – Fritjof Berggren Oct 01 '19 at 21:07
  • Error message seems pretty self-explanatory to me. You are trying to decompress data that the `DeflateStream()` class does not recognize as a supported compression method. Either the data is corrupted or it uses a different compression method. – Peter Duniho Oct 02 '19 at 02:06
  • Hi, I see that too. However as you can see from the PDF file I took the stream from it was compressed one time using the FlateDecode algorithm, that it's the same I'm using in c# while trying to decompress it (Deflate) – Fritjof Berggren Oct 02 '19 at 06:26

1 Answers1

4

The main problem is that the DeflateStream class can decode a naked FLATE compressed stream (as per RFC 1951) but the content of PDF streams with FlateDecode filter actually is presented in the ZLIB Compressed Data Format (as per RFC 1950) wrapping FLATE compressed data.

To fix this it suffices to drop the two-byte ZLIB header.

Another problem became clear in your first example document: That document was encrypted, so before FLATE decoding the stream contents therein have to be decrypted.

###Drop ZLIB header to get to the FLATE encoded data

The DeflateStream class can decode a naked FLATE compressed stream (as per RFC 1951) but the content of PDF streams with FlateDecode filter actually is presented in the ZLIB Compressed Data Format (as per RFC 1950) wrapping FLATE compressed data.

Fortunately it is pretty easy to jump to the FLATE encoded data therein, one simply has to drop the first two bytes. (Strictly speaking there might be a dictionary identifier between them and the FLATE encoded data but this appears to be seldom used.)

in case of your code:

var bytes = File.ReadAllBytes("Stream.file");
var originalFileStream = new MemoryStream(bytes);

originalFileStream.ReadByte();
originalFileStream.ReadByte();

using (var decompressedFileStream = new MemoryStream())
using (var decompressionStream = new DeflateStream(originalFileStream, CompressionMode.Decompress))
{
    decompressionStream.CopyTo(decompressedFileStream);
}   

###In case of encrypted PDFs, decrypt first

Your first example file pdf-test.pdf is encrypted as is indicated by the presence of an Encrypt entry in the trailer:

trailer
<</Size 37/Encrypt 38 0 R>>
startxref
116
%%EOF

Before decompressing stream contents, therefore, you have to decrypt them.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • This is the trailer: trailer <> and it doesn't show the Encrypt object. – Fritjof Berggren Oct 02 '19 at 10:23
  • 1
    That may be the trailer after you changed your question to refer to a completely different test file. Please don't make your question such a moving target. – mkl Oct 02 '19 at 10:25
  • I couldn't locate the previous file, that's why I changed it to a easier file. but the problem is the same. – Fritjof Berggren Oct 02 '19 at 10:26
  • 1
    Please share the new file. Dumping binary data as text most likely damages it. – mkl Oct 02 '19 at 10:28
  • Sorry for changing the file, this is the latest: https://eternal-todo.com/files/pdf/myPDF_flatedecode.pdf – Fritjof Berggren Oct 02 '19 at 10:29
  • if you open the stream on HxD you'll see it starts with `78 9C` which is the start of a header of a valid Deflate Stream – Fritjof Berggren Oct 02 '19 at 10:38
  • `78 01 - No Compression/low 78 9C - Default Compression 78 DA - Best Compression ` – Fritjof Berggren Oct 02 '19 at 10:40
  • 1
    That PDF is broken (a hint is that after opening it in Adobe Reader and closing it again, Adobe Reader asks whether it should be saved). In particular your example stream is damaged, it claims a size of 64 bytes but only is 60 or 61 bytes in size. – mkl Oct 02 '19 at 10:51
  • You are right on the size, but in case there is a mismatch you should follow the stream length instead. I'm opening that file in Adobe and doesn't complain with any errors. – Fritjof Berggren Oct 02 '19 at 10:53
  • Thank you very much for the answer, and the detailed explanation, it really helped. – Fritjof Berggren Oct 02 '19 at 19:01