1

I've got a bunch of files that from metadata I can tell are supposed to be PDFs. Some of them are indeed complete PDFs. Some of them appear to be the first part of a PDF file, though they lack the %%EOF and other footer values.

Others appear to be the last part of PDF files (they don't have any of a PDF's headers but they do have the %%EOF stuff). Curiously they start with the following 16-byte magic header:

0x50, 0x4B, 0x57, 0x41, 0x52, 0x45, 0x00, 0x00, 0x00, 0x00, 0x00, 0x57, 0x49, 0x4E, 0x33, 0x32 (PKWARE WIN32).

I'm doing a lot of inference which could possibly be misleading, but it doesn't seem to be a compression scheme (the %%EOF stuff is plaintext) and in the few files I've been allowed to look at deeply there's a correlation between starting with this magic and looking like the final segment of a PDF binary.

Does anyone have any hints as to what file format might be at play here?

Update: I've now observed this PKWARE WIN32 happening on non-PDF files as well. Speculation also suggests that these files are split up in a similar manner.

Update 2: It turns out this PKWARE WIN32 header actually occurs in repeating intervals, the location of which can be predicted by some bytes immediately prior to the header.

I've also received some circumstantial hearsay which suggests that these files are compressed and not split into multiple parts, though in 2 out of the 3 cases where I was told the output file sizes my binaries were only negligibly smaller.

The mystery continues.

Hammer Bro.
  • 965
  • 1
  • 10
  • 23
  • PKWARE suggests something like a zip file, though that is not a zip header. PKWare do produce some encryption software as well though, possibly it is something to do with that. However, %%EOF is not unique to PDF files by any means! It is also a normal comment in PostScript programs, I don't think you can assume that the content is a PDF file just because of the presence of that. Is there a startxref or xref token ? Or a document information dictionary, metadata, x y obj...endobj sequence ? Something more than just a %%EOF. – KenS Nov 03 '21 at 09:10
  • Yeah, they've also got the `startxref` stuff that more strongly implies that they are PDF-adjacent as the metadata suggests they are. – Hammer Bro. Nov 03 '21 at 09:13
  • startxref certainly sounds like a PDF file. Since you apparently have the beginning of a PDF file in some places, and the ends of PDF files in others, is it possible someone has used a utility to split large files into smaller chunks ? – KenS Nov 03 '21 at 12:06
  • It's a possibility and a theory I'm pursuing but if it's true then at the very least the pieces don't share any locality. There are a lot of files and they're spread out all over the place. I'm also not aware of any PKWARE splitting utilities -- multipart zips don't have those headers, for instance – Hammer Bro. Nov 03 '21 at 13:10
  • Not any meaningful samples. Sensitive data behind restrictive networks and whatnot. I'll try to get myself access to a few more sets of files tomorrow and see if I observe this phenomenon on any non-PDF file types. – Hammer Bro. Nov 03 '21 at 22:17
  • One of them ends with `/Size 629/Info 1 0 R/Root 2 0 R>> startxref 2768002 %%EOF` but the file only contains 224 ' obj' strings, starting at 116 and quickly jumping to 340 and skipping the early 500s as well. Not sure what that means. The file is 2421980 bytes long, which is close to but lower than the listed number. I'll assume that's further evidence that there's a top part somewhere out there. – Hammer Bro. Nov 04 '21 at 08:44
  • The `/Size`, `/Info`, `/Root` etc. is part of the PDF data itself. – CherryDT Nov 10 '21 at 23:44

1 Answers1

0

Okay, so this ended up being a very strange format. Overall it's a compression scheme, but it's applied inconsistently and lightly wrapped in a way that confounded the issue.

The first 8 bytes of any of these files will start with its own magic, and the next 8 bytes can be read as a long to tell us the final size of the output file.

Then there's a 16 byte "section" (four ints) whose first number is just an incremental counter, whose second int represents the number of bytes until the next "section" break, whose third int is a bit of a mystery to me, and whose fourth int is either 0 or 1. If that int is 0, just read the next (however many) bytes as-is. They're payload.

If it's 1 then you'll get one of these PKWARE headers next. I honestly know how to interpret them the least-well other than they start with the magic in the original question and they're 42 bytes long in total.

If you had a PKWARE header, subtract 42 from the number of bytes to read then treat the remaining bytes as compressed using PKWARE's "implode" algorithm. Meaning you can use zlib's "explode" implementation to decompress them.

Iterate through the file taking all these headers into account and cobbling together compressed and uncompressed parts 'til you run out of bytes and you'll end up with your output file.

I have no idea why only parts of files are compressed nor why they've been broken into blocks like this but it seems to work for the limited sample data I have. Perhaps later on I'll find files that actually have been split up along those boundaries or employ some kind of fancy deduplication but at least now I can explain why it looked like I saw partial PDFs -- the files were only partially compressed.

Hammer Bro.
  • 965
  • 1
  • 10
  • 23
  • Could you contact me please? I would ask you regarging consulting service to my company https://www.linkedin.com/in/dima-paikin-1448404/ – barambuk May 16 '22 at 11:50