0

I am trying to figure out how to actually (appropriately) read for the PDF trailer Byte_offset_of_last_cross-reference_section from a PDF file.

According to the PDF 1.7 (ISO 32000-1:2008) specification, the file structure is designed in a way that it should be read from the end of the file. Here is an example of what a simplified (minimal) trailer looks like when I use a StreamReader and read the file line-by-line (UTF8 Encoding):

trailer
<< key1 value1
     key2 value2
     …
     keyn valuen
>>
startxref
Byte_offset_of_last_cross-reference_section
%%EOF

trailer
<</Root 7 0 R /Size 7>>
startxref
696
%%EOF

The value I want to somehow grab is the 696 value. I'm just not sure how to do that using a BinaryReader starting from the end of the file.

myermian
  • 31,823
  • 24
  • 123
  • 215

2 Answers2

2

You can use the Seek method, see here for examples. You can use SeekOrigin.End as argument, see here for other options

example:

using (var reader = File.Open(...))
{
    reader.Seek(100, SeekOrigin.End);
    //...
}

You can start reading backwards in a loop till you get to the startxref marker (or anything that helps you know that you can read 696) or assume a length of 100 bytes from the end of the file and then do a lookup in that small array as Anthony suggested in the comment below.

Sebastian Piu
  • 7,838
  • 1
  • 32
  • 50
  • Link-only answers are discouraged. Please reflect the core of what you are trying to show in your answer by providing a code snippet or a more elaborate explanation. – Jeroen Vannevel Nov 21 '13 at 20:09
  • 1
    It might be better to start from `reader.Length - 50` and continue to seek forward until you find what you need. Not sure how good it would be to actually seek 1 byte at a time backwards in a file. – Anthony Nov 21 '13 at 20:48
  • Actually if you want to emulate the laxity of e.g. Adobe Reader, you would start from `reader.Length - 1000` and tolerate some trash bytes after the EOF marker. Cf. The implementation notes. – mkl Nov 21 '13 at 21:50
  • @mkl: Where did you see that Adobe Reader starts 1000 bytes back? Also, why 1000 instead of 1024, considering that's the normal buffer size. – myermian Nov 21 '13 at 22:03
  • You are right, it's not exactly 1000. But I talked about *emulating the laxity of e.g. Adobe Reader* and by that didn't mean operating exactly like that one product but allowing a certain fairly common degree of laxity. – mkl Nov 21 '13 at 22:34
0

How about using something like:

List<string> allLines = File.ReadAllLines(filePathHere);
return allLines[allLines.Count - 2];
Sourav 'Abhi' Mitra
  • 2,390
  • 16
  • 15
  • 1
    Per the specifications, it isn't recommended to read the file line by line forward. It is recommended (as the question states) to read the file from end to start. – myermian Nov 21 '13 at 20:21
  • PDFs can be pretty big. Reading all lines like this merely to retrieve one number is a huge waste of resources. – mkl Nov 21 '13 at 20:46