How to repair pdf file without xref in php?

Question

I have pdf file without xref table, this pdf was generated by 3d side service

Is any library\solution to fix this pdf file without open it in adobe acrobat? the error is Unable to find xref table

*I have pdf file without xref table, this pdf was generated by 3d side service* - Have you checked whether the PDF simply has a cross reference stream instead of the cross reference table? If it has, there is no need for a cross reference table. Furthermore, if object streams are used in the PDF, the cross references have to be put into a stream instead of a table... — mkl, Jul 27 '16 at 07:18
mkl brings up an interesting point. if the PDF version is >=1.5, the startxref could point to a cross reference stream (see section 3.4.7), and if that's what is happening could imply that your PDF API isn't able to properly process a cross reference stream, in which case the better course of action would be to upgrade your PDF API. — Patrick Gallot, Jul 28 '16 at 16:24
version of file is PDF-1.4.%. I don't have very deep experience with pdf structure, I showed this file to some people and they say it's corrupted — kusanagi, Jul 28 '16 at 16:31
in that case, @mkl's comment from [link](http://stackoverflow.com/questions/16928698/with-php-how-can-i-check-if-a-pdf-file-has-errors?rq=1) points to the likeliest source of the corruption. — Patrick Gallot, Jul 28 '16 at 18:26
@kusanagi Essentially we are poking into the dark now. Is it possible for you to share the file? — mkl, Jul 28 '16 at 19:27
http://diplom.spbrealty.pp.ua/test.pdf here is pdf file, thanks — kusanagi, Jul 29 '16 at 05:52

score 5 · Answer 1 · answered Jul 26 '16 at 15:55

5

Creating an xref table for a pdf that never had any shouldn't be too hard (unlikely to involve linearization or incremental saves), so you have to wonder at the quality of the PDF that was generated by that PDF Producer.

Get a copy of the PDF (v1.7) Reference; the sections you'll need to reference are 3.2.9, and 3.4 (3.4.3 and 3.4.4 in particular), and open up your file in a hex editor.

Scroll to the very bottom of the file. The file should end with "%%EOF"; immediately preceding that should be 'startxref'[\r\n] followed by a number which is the byte offset for the start of the 'xref' section. Based on your error message, this number is likely missing or off. The xref section is usually after the last endobj but above the trailer section which itself is above the startxref section. You will want to keep a copy of the trailer to tack back on after you have written out the 'xref' section.

To create the xref section, you need to scan the body of the PDF for lines consisting of: IDNumber GenNumber 'obj'\r\n. In the simplest case, GenNumber will always be 0, and IDNumber will always be increasing in sequential order as you move from top to bottom. (If GenNumber is ever not zero, then you are dealing with a file that has been incrementally saved; that's a complication you dont want to deal with). Keep track of the offsets of each of those lines along with the IDNumber and GenNumber. write out a first line in the xref consisting of the first IDNumber and the number of indirect objects found (assuming that they are all in sequential order). Then, for each indirect object write out the offset (padded to 10 digits), space, GenNumber (00000), space,'n', eol (\r\n). Afterwards, tack on the trailer that was saved earlier, and the startxref section, and the '%%EOF' line. Save your file, and see if that fixes the problem for the file.

answered Jul 26 '16 at 15:55

Patrick Gallot

595
3
11

thanks for answer, I opened pdf file in vim hex, at the end I found `startxref.111945.%%EOF` but can't find IDNumber and GenNumber. Also it's very new area for me such as hex edit or pdf structure – kusanagi Jul 28 '16 at 16:50
if it's not hard for you may be you can look at this problem pdf? – kusanagi Jul 28 '16 at 16:53
1

When you actually look in the file you will see byte sequences like: 13 0 obj, 19 0 obj, or 27 0 obj and so on. The first number is the ID of the indirect object, the second number is the generation and is only relevant the file has been incrementally saved. – Patrick Gallot Jul 28 '16 at 18:04
1

111945, in hexadecimal is 0001 B549; it is an offset within the file You will want to look at the addresses on the left-side of the hex view and find the line labeled with a hex number less than 0x1B549, and then count up to 1B549 to find where that offset actually lands. byte 0x1B549 should be the 'x' of xref. – Patrick Gallot Jul 28 '16 at 18:20
thanks but I try to understand all this information about hex and offsets and it's still not clear for me – kusanagi Jul 28 '16 at 19:16
can you give some code sample? – kusanagi Jul 29 '16 at 05:52
When you view a file in a hex viewer, the file is displayed in a grid with the top left row having a label of '0', with the leftmost column being column 0 and increasing to the right. The second row will then have a label equal to the number of columns, but (usually) as a hex value. The offset of any given byte from the beginning of the file can be determined by adding the column number to the row label. In this way, you can visually 'seek' to specific addresses within the file. If this is new to you, then this may be a more advanced project than you may have expected from my description. – Patrick Gallot Jul 29 '16 at 20:50

score 3 · Answer 2 · answered Aug 05 '16 at 09:07

The actual problem of the file

Having inspected the file provided by the OP it turns out that the base problem is not a missing cross reference table. Instead the problem ultimately is that the file in fact is a combination of two complete PDF files, the first one 93863 bytes in size and the second one 112857 bytes.

Both show the same form, the only difference being that the second one has six QR codes added at the bottom.

Probably someone attempted to merge the two PDFs (which simply doesn't work this way), or maybe it happened completely accidentally.

Thus, what the OP actually needs is a tool to split the file after 93863 bytes, right before the %PDF-1.4 file header there.

Why this error message

If you wonder why some program claimed Unable to find xref table - at the end of a PDF there are lines like this:

startxref
111945
%%EOF

The number indicates at which offset counted from the start of the file the cross references are located.

Thus, if you have a file in which there are two PDFs in a row, this offset is falsified (as the offset is counted from the very start of the file) and points into the file where there are no cross references now.

Some programs in such a situation attempt to repair the file, reconstructing a cross reference table, while others fail with an error. Adobe Reader is of the former type and the program the OP tried to run of the latter.

score 0 · Answer 3 · answered Jul 22 '16 at 13:40

0

maybe:

http://www.verypdf.com/wordpress/201302/how-to-repair-pdfs-corrupted-xref-table-and-stream-lengths-34784.html

You could fix it yourself if you are (very!) familiar with the PDF format :) PDF is internally text.. well, except the streams and embedded objects

answered Jul 22 '16 at 13:40

Honk der Hase

2,459
1
14
26

as I see it's windows tool, any way I need any php-way solution – kusanagi Jul 22 '16 at 13:45
maybe you should contact the provider of the PDF to fix this in the first place... – Honk der Hase Jul 22 '16 at 13:50
it's the last way, first I want to try fix the pdf – kusanagi Jul 22 '16 at 13:51

score -1 · Answer 4 · answered Jul 25 '19 at 03:14

-1

The provider of the PDF is an HP product (device)

answered Jul 25 '19 at 03:14

Mr. Q

1

Mr. Q, please elaborate on how this helps to solve the problem. Use the [edit] link to add more. – Yunnosch Jul 25 '19 at 06:35

How to repair pdf file without xref in php?

4 Answers4

The actual problem of the file

Why this error message