1

I need to read a binary file in java and split it up (its actually a binary file containing many pdf files, with a single line "metadata" before each).

Each pdf item from the binary file ends with a "%%EOF" marker.

My first attempt, I read the file line by line as a UTF-8 file, but this corrupted the binary data!!

reader = new BufferedReader(new InputStreamReader(new FileInputStream(binaryFile), "UTF-8"));

String mdmeta;
while ((mdmeta = reader.readLine()) != null) {
    System.out.println("read file metadata: " + mdmeta);
    writeToFile("exploded-file-123");
}

and method writeToFile

BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fullFilename), "UTF-8"));

writer.write("%PDF-1.4\r\n");
String line;
while ((line = reader.readLine()) != null) {
    writer.write(line);
    writer.write("\r\n");
    if ("%%EOF".equals(line)) {
        writer.flush();
        return;
    }
}

... although this splits up the file into exploded items, those binary files are corrupt (certainly because I read and wrote the bytes as UTF-8 strings...)

I think I need a more low level approach, using InputStream's.

It gets complicated since the files can be large. Imagine I use a buffer... I can read bytes from the file to fill the buffer... then I need to look for the "%%EOF" inside the buffer... and manually split the buffer between the previous exploded item and the next one.

Or if "%%EOF" falls on the buffer edge then I might miss the file boundary completely...

I guess I'm looking for some sort of way to readBytesUpUntil("%%EOF") - is there an easy way to do this?

Achal
  • 11,821
  • 2
  • 15
  • 37
vikingsteve
  • 38,481
  • 23
  • 112
  • 156
  • do you know the encoding this binary file is in? – Eugene Sep 04 '18 at 14:15
  • not exactly sure I understood this properly, but if you know the original encoding, why not convert `%%EOF` into that encoding and search just for that, keeping the original encoding all the time – Eugene Sep 04 '18 at 14:17
  • PDF files are binary files. You treat them as text. This is very likely to damage the PDFs beyond repair. Instead copy content based on the byte sequences you retrieve from an `InputStream` and store them in `OutputStream` instances. – mkl Sep 04 '18 at 14:50
  • @Eugene I had presumed otherwise, but PDF files apparently have "no" encoding. They have byte streams that can really be encoded as anything – vikingsteve Sep 04 '18 at 20:55
  • @mkl do you mean to build a byte sequence in memory, until I find the %%EOF, and then write it to file? – vikingsteve Sep 04 '18 at 20:56
  • That's one option. Alternatively you can read the content a block at a time, search it for the end sequence, and write that block early. Be sure, though, to consider the case of the end sequence being split across two blocks. – mkl Sep 05 '18 at 04:36

1 Answers1

3

PDF viewers start reading a file at the end. They look for the %%EOF, and then for the start of the xref table aka the cross-reference table. The cross reference table maps all objects to their byte offset.

For instance:

  • the object with number 1 starts at byte position 12578
  • the object with number 2 starts at byte position 158
  • the object with number 3 starts at byte position 9821
  • the object with number 4 starts at byte position 18792
  • ...

And so on.

A PDF viewer also looks for the object number of the /Catalog aka the root dictionary of the PDF document. It searches for the /Catalog object by going to the byte offset as defined in the cross-reference tabel.

From that root dictionary, a PDF viewer obtains the root of the /Pages tree. From the /Pages tree, it gets information about the pages in the PDF, including where to find all the content and resources needed to render a page.

All of this happens through random-access of the file at byte offsets retrieved from the cross-reference table based on object numbers.

Now:

  • Imagine that you insert some arbitrary bytes into a PDF file,
  • Imagine that you don't adapt the cross-reference table,
  • How do you expect a PDF viewer can find the objects it needs to render the document?

Additionally, a PDF can contain more than one %%EOF marker. This is the case with Linearized PDF, and this is the case with PDFs that have been incrementally updated.

Such PDF files also have to be read started at the last byte. In the cross-reference table of the last revision, some existing objects will be replaced and new objects will be added, but you'll still need the cross-reference table of the previous revisions, otherwise, you can't render anything.

Now:

  • Imagine that you would split a file that is incrementally updated based on the occurrence of %%EOF,
  • Imagine that you would save each of those snippets as a separate file,
  • Then only the first file would be a valid PDF file; all the consecutive files would be missing resources such as fonts, reused images, etc. The consecutive files would not be full PDF documents.

In short:

Splitting a long PDF document based on the occurrence of %%EOF is not wise. Even if a series of valid PDF files are glued together, you risk ending up breaking those files, because a single PDF file can have more than one occurrence of %%EOF.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Hi Bruno, wow - the famous author of pdf tools? Thank you for stopping by a mere mortal like myself to help out. I'll definitely reread what you have written tomorrow at work - THANK YOU! – vikingsteve Sep 04 '18 at 20:54
  • Im looking at some old code that looks for the "start" of a pdf file, does each PDF file have a start marker like this? `if ((pdfBuf[i] == 37) && (pdfBuf[i + 1] == 80) && (pdfBuf[i + 2] == 68) && (pdfBuf[i + 3] == 70))` – vikingsteve Sep 05 '18 at 08:39