Writing PDF binary file from stream yields malformed PDF

Question

Dear Stack Overflow users,

I would appreciate you kind help with the following problem: We have an Apache server functioning as a forward proxy, with ext_filter configured: whenever the response is of MIME type PDF, the filter is called (a perl script), and the PDF's content may be read from the STDIN. We read the PDF from STDIN, write it to a file and that's all. This almost always work well, but on one specific website, the PDF is malformed when written in the following way:

my $input_file = shift;
binmode STDIN;
open(OUT, ">" . $input_file);
binmode OUT;
foreach my $line (<STDIN>){
        print OUT $line;
}
close OUT;

If we instead call 'tee' (set the filter to use 'tee')- the file is written correctly. Analyzing the malformed PDF shows that the xref table is malformed in the PDF we write and Adobe Reader fails to open it. We have already tried using sysopen,sysread etc. , using ":raw", and several other ways to write a binary file properly, and nothing worked (cut&paste code from documnetation for writing binary files). Only when using the 'tee' utility in linux as the filter, it was written correctly. This doesn't help us- we need to be able to write it to a file from stdin as part of the perl script. Any suggestions? If there could be a way to somehow call 'tee' with a system call, and give it STDIN of the perl program- it might could work. Many thanks in advance.

You copy from STDIN to your result file in a *line-by-line* manner. PDFs (even though they may in large parts look like text files) have to be treated as binary files. If e.g. end-of-line characters of one type are replaced by another one (CR LF by LF), the position of any following object changes and, thus, the cross references are suddenly wrong. Furthermore binary partial streams are included in PDFs, e.g. fonts or compressed text streams. If any replacements take place there, those streams are rendered unreadable. — mkl, Jun 26 '13 at 12:40
This works for me with a couple of PDFs. Even for a 1GB DVD-ISO-image it produced a copy that was completely identical. So I assume that you code is ok. Is this one site doing something special? Content-Type, KeepAlive, Deflate, etc? Btw: I don't even want to know, why you do this. — innaM, Jun 26 '13 at 13:53
Thanks, @innaM- There must be something special, as also Ghost Script had a problem opening it (although it may be opened by Adobe, before we ruin it...) I'll reanalyze the PDF for special elements and will update afterwards. — user2522941, Jun 27 '13 at 04:47

score 0 · Answer 1 · answered Aug 18 '13 at 12:13

Well, although the code was basiclly correct, putting it inside "eval" somehow ruined thd PDF. I still don't understand why, but deleting the eval solved the problem.

The perl is called from a context of ext_filter module of Apache.

I'll farther investigate this and update when I'll find an explanation for this.

Thanks for everyone.

Writing PDF binary file from stream yields malformed PDF

1 Answers1