5

I'm developing a PDF parser/writer, but I'm stuck at generating cross reference streams. My program reads this file and then removes its linearization, and decompresses all objects in object streams. Finally it builds the PDF file and saves it.

This works really well when I use the normal cross reference & trailer, as you can see in this file.

When I try to generate a cross reference stream object instead (which results in this file, Adobe Reader can't view it.

Has anyone experience with PDF's and can help me search what the Problem is?

Note that the cross reference is the ONLY difference between file 2 and file 3. The first 34127 bytes are the same.

If someone needs the content of the decoded reference stream, download this file and open it in a HEX editor. I've checked this reference table again and again but I could not find anything wrong. But the dictionary seems to be OK, too.

Thanks so much for your help!!!

Update

I've now completely solved the problem. You can find the new PDF here.

Van Coding
  • 24,244
  • 24
  • 88
  • 132

2 Answers2

7

Two problems I see (without looking at the stream data itself.

  1. "Size integer (Required) The number one greater than the highest object number used in this section or in any section for which this shall be an update. It shall be equivalent to the Size entry in a trailer dictionary."

    your size should be... 14.

  2. "Index array (Optional) An array containing a pair of integers for each subsection in this section. The first integer shall be the first object number in the subsection; the second integer shall be the number of entries in the subsection The array shall be sorted in ascending order by object number. Subsections cannot overlap; an object number may have at most one entry in a section. Default value: [0 Size]."

    Your index should probably skip around a bit. You have no objects 2-4 or 7. The index array needs to reflect that.

  3. Your data Ain't Right either (and I just learned out to read an xref stream. Yay me.)

00 00 00  
01 00 0a  
01 00 47  
01 01 01  
01 01 70  
01 02 fd  
01 76 f1  
01 84 6b  
01 84 a1  
01 85 4f

According to this data, which because of your "no index" is interpreted as object numbers 0 through 9, have the following offset:

0 is unused.  Fine.  
1 is at 0x0a.  Yep, sure is  
2 is at 0x47.  Nope.  That lands near the beginning of "1 0"'s stream. This probably isn't a coincidence.  
3 is at 0x101.  Nope.  0x101 is still within "1 0"'s stream.  
4 is at 0x170.  Ditto  
5 is at 0x2fd.  Ditto  
6 is at 0x76f1. Nope, and this time buried inside that image's stream.

I think you get the idea. So even if you had a correct \Index, your offsets are all wrong (and completely different from what's in resultNormal.pdf, even allowing for dec-hex confusion).

What you want can be found in resultNormal's xref:

xref  
0 2  
0000000000 65535 f  
0000000010 00000 n  
5 2  
0000003460 00000 n  
0000003514 00000 n  
8 5  
0000003688 00000 n  
0000003749 00000 n  
0000003935 00000 n  
0000004046 00000 n  
0000004443 00000 n  

So your index should be (if I'm reading this right): \Index[0 2 5 2 8 5]. And the data:

0 0 0  
1 0 a  
1 3460 (that's decimal)  
1 3514 (ditto)  
1 3688  
etc

Interestingly, the PDF spec says that the size must be BOTH the number of entries in this and all previous XRefs AND the number one higher than the highest object number in use.

I don't think the later part is ever enforced, but I wouldn't be surprised to find that xref streams are more retentive than the normal cross reference tables. Might be the same code handling both, might not.


@mtraut:

Here's what I see:

13 0 obj <</Size 10/Length 44/Filter /FlateDecode/DecodeParms <</Columns 3/Predictor 12>>/W [1 2 0]/Type /XRef/Root 8 0 R>>
stream  
...  
endstream  
endobj  
Mark Storer
  • 15,672
  • 3
  • 42
  • 80
  • Thanks so much!!! I think it is the thing with the /Index tag. Because it is optional I ignored it. But because my objects aren't from 0 to 9, it is required. – Van Coding Dec 30 '10 at 08:37
  • Ok, now I saved the ref stream without encryption and added the /Index Array to the Dictionary. Now it works great! Thanks so much!! – Van Coding Dec 30 '10 at 08:49
  • 1
    I know... I know: I *rock*. And I'm super modest too! And good looking. Let's not forget "good looking". Err... I mean... "you're welcome". – Mark Storer Dec 30 '10 at 17:06
0

The "resultstream.pdf" does not have a valid cross ref stream.

if i open it in my viewer, he tries to read object " 13 0 " as a cross ref stream, but its a plain dictionary (stream tags and data is missing).

A little out of topic: What language are you developing in? At least in Java a know of three valuable choices (PDFBox, iText and jPod, where i personally as one of the developers opt for jPod, very clean implementation :-). If this does not fit your platform, maybe you can at least have a look at algorithms and data structures.

EDIT

Well - if "resultstream.pdf" is the document in question then this is what my editor (SCITE) sees

...
13 0 obj
<</Size 0/W [1 2 0]/Type /XRef/Root 8 0 R>>
endobj
startxref
34127
%%EOF

There is no stream.

mtraut
  • 4,720
  • 3
  • 24
  • 33
  • I don't think that's the problem. I see "stream .... endstream" in notepad. Perhaps your viewer has an issue? – Mark Storer Dec 30 '10 at 00:27
  • SciTE is a text editor (right?), so that's not the issue. I'm gonna guess you opened the PDF in some viewer app and then saved from there, rather than saving from the link directly. ALWAYS save from the link, ESPECIALLY when dealing with a known-corrupt PDF. Viewers tend to try to fix the file, so the version they save may be quite different from what they loaded. – Mark Storer Dec 30 '10 at 20:53
  • @Mark, i apologize for beeing pedantic... I did open the PDF with OUR viewer and got the message that 13 0 is not a stream. After YOUR comment i copied a fragment of the PDF to show what i'm seeing. Clicking and saving does not work anyway because Adobe can't open (and as such there is no save). Today when i download "resultstream.pdf" there is a version WITH a stream attached. I have overwritten the previous version, so i can no longer compare - maybe the experiments of @FlashFan have changed the document in the meantime. BTW. thank you for your careful analysis of the topic in your answer – mtraut Dec 31 '10 at 09:05
  • 2
    No worries, reminds me of a funny story. Some not-as-smart-as-she-thought-she was friend of ours in college was trying to build up a good rant, and I kept correcting her on the details. She gets upset and says: "You're being pedastic!". To which I replied "PedANTic". That completely sucked the wind out of her sails. I even convinced her to look it up and she found she was wrong. Again. Good times... good times...- – Mark Storer Dec 31 '10 at 16:50
  • PS: which viewer is "OUR viewer"? Something publicly available/trialable? SCITE == SciTE? – Mark Storer Dec 31 '10 at 16:58
  • @mark - hmmm, i'm not completely sure which part is mine - maybe dont wanna know... – mtraut Dec 31 '10 at 17:26
  • Our viewer is "CABAReT Stage", free download at "www.cabaret-solutions.com". Based on jPod and jPodRenderer. SCITE is just SciTE... – mtraut Dec 31 '10 at 17:27