How to upload binary files in mod_perl with CGI.pm?

Question

I have a big piece of production code, that works. But after I setup a new environment in virtual machine I have one issue -- everytime I need to upload a binary file it become messed up with unicode conversions.

So there is a sub, where issue is:

sub save_uploaded_file
{
    # $file is obtained by param(zip) 
    my ($file) = @_;
    my ($fh, $fname) = tmpnam;
    my ($br, $buffer);
    # commenting out next 2 lines doesn't help either
    binmode $file, ':raw';
    binmode $fh, ':raw';
    while ($br = sysread($file, $buffer, 16384))
    {
        syswrite($fh, $buffer, $br);
    }
    close $fh;
    return $fname;
}

Its used to upload zip archives, but they are uploaded as malformed (their size is always bigger than in original) and I looked inside of them with hex editor and found that there are lots unicode replacement charaters, encoded in utf-8, inside (EF BF BD).

I figured out that the total sum of bytes read is bigger than original file. So the problem starts at sysread.

Text files uploads well.

Update: There is a binary representation of first few bytes of transfered file:

0000000: 504b 0304 1400 0000 0800 efbf bd1c efbf  PK..............
0000010: bd3e efbf bd1d 3aef bfbd efbf bd02 0000  .>....:.........
0000020: efbf bd05 0000 0500 1c00 422e 786d 6c55  ..........B.xmlU
0000030: 5409 0003 5cef bfbd efbf bd4d 18ef bfbd  T...\......M....
0000040: efbf bd4d 7578 0b00 0104 efbf bd03 0000  ...Mux..........
0000050: 0404 0000 00ef bfbd efbf bdef bfbd 6bef  ..............k.

And the original one:

0000000: 504b 0304 1400 0000 0800 b81c d33e df1d  PK...........>..
0000010: 3aa0 8102 0000 a405 0000 0500 1c00 422e  :.............B.
0000020: 786d 6c55 5409 0003 5cd4 fc4d 18c7 fc4d  xmlUT...\..M...M
0000030: 7578 0b00 0104 e803 0000 0404 0000 008d  ux..............
0000040: 94df 6bdb 3010 c7df 03f9 1f0e e1bd 254e  ..k.0.........%N
0000050: ec74 6c85 d825 2bac 9442 379a c25e ca8a  .tl..%+..B7..^..

Update2 The running software is centos 5.6, perl 5.8.8, apache 2.2.3

Nitpick, there is no such thing as Unicode characters in a file, Unicode must be encoded to exist in a file. Common encodings are UTF-16, UTF-8, etc. It sounds like you have UTF-8 characters. — Chas. Owens, Jun 19 '11 at 01:08
files inside zip are encoded in CP1251 (which doesn't matter, since zip-file itself is binary). Script source files are encoded in koi8-r. — kravitz, Jun 19 '11 at 02:32

score 0 · Answer 1 · answered Oct 09 '13 at 15:07

I had what I think is the same problem. The error seemed to be occurring very early, because none of my code ever executed when client attempted to load a binary file. I fixed it by setting STDIN to "raw" (binary), at the top of the script…

binmode(STDIN, ':raw') ;

score 0 · Answer 2 · answered Jun 19 '11 at 01:05

0

Does tmpnam returns a filehandle marked as utf8? I think not!

try binmode $fh, ":utf8" ;

answered Jun 19 '11 at 01:05

cirne100

1,558
2
12
24

nope, same thing. Also I DON'T NEED an utf-8 encoding, I want to copy bytes as-is, in fact it is my issue that it tries to encode in utf-8 everything that goes from sysread. – kravitz Jun 19 '11 at 01:32

Chas. Owens · Answer 3 · 2011-06-19T13:57:16.613

As far as I know, Perl 5 doesn't swap in the replacement character in any of its io layers. They only conversions I am aware of are newline conversions (i.e. the text layer). Are you certain the source file does not contain those byte sequences?

This code works for me, does it work for you?

#!/usr/bin/perl

use strict;
use warnings;

use File::Temp qw/:POSIX/;

sub save_uploaded_file {
    # $file is obtained by param(zip) 
    my ($file) = @_;
    my ($fh, $fname) = tmpnam;
    my ($br, $buffer);
    # commenting out next 2 lines doesn't help either
    binmode $file, ':raw'
        or die "could not change input file to raw: $!";
    binmode $fh, ':raw'
        or die "could not change tempfile to raw: $!";
    while ($br = sysread($file, $buffer, 16384)) {
        syswrite($fh, $buffer, $br);
    }
    close $fh
        or die "could not close tempfile: $!";
    return $fname;
}

sub check {
    my $input_file = shift;

    print "$input_file is ", -s $input_file, " bytes long\n"; 

    open my $fh, "<:raw", $input_file
        or die "could not open $input_file for reading: $!";

    my $bytes = sysread $fh, my $buf, 4096;

    print "read $bytes bytes: ", 
        join(", ", map { sprintf "%02x", $_ } unpack "C*", $buf),
        "\n";
}

my $input_file = "test.bin";

open my $fh, ">:raw", $input_file
    or die "could not open $input_file for writing: $!";

print $fh pack "CC", 0xFF, 0xFD
    or die "could not write to $input_file: $!";

close $fh
    or die "could not close $input_file: $!";

check $input_file;

open my $newfh, "<", $input_file
    or die "could not open $input_file: $!";
my $new_file = save_uploaded_file $newfh;

check $new_file;

absolutely, I compared them visualy in xxd, replacement characters start to occur after 11 byte in trasfered file, while in original there are none of them. Also I can track, that some bytes are placed unchanged (ones that fit in first 127 digits range). — kravitz, Jun 19 '11 at 01:37
What characters are getting replaced by `"\x{FFFD}"`? Or is `"\x{FFFD}"` being inserted? — Chas. Owens, Jun 19 '11 at 01:58
I created file with this two bytes, they transfered as "\x{efbfbdefbfbd}" — kravitz, Jun 19 '11 at 02:31
If i place it as a scalar directly to a $buffer, then syswrite complains about wide characters. If I put "\x{FF}\x{FD}" instead, it goes directly to file as-is (FF FD). — kravitz, Jun 19 '11 at 02:38
`"\x{FF}\x{FD}"` is not the same thing as `"\x{FFFD}"`. The first is LATIN SMALL LETTER Y WITH DIAERESIS and LATIN SMALL LETTER Y WITH ACUTE. The second is REPLACEMENT CHARACTER. The `"\x{}"` form uses Unicode code points. There is no direct relation between code points and bytes. You must know the encoding used to write the code points. — Chas. Owens, Jun 19 '11 at 13:22

cirne100 · Answer 4 · 2011-06-19T03:37:01.140

0

sysread is reading the file as utf8, but the file is not utf8! the first ten bytes are in "the basic latin range" (00-7F) so they are interpreted as the same byte. The next byte 'b8' is not in the valid range and its being replaced by 'efbfbd' <=> \x{FFFD} (a special char to indicate a decoding error). All the bytes greater than 7F are being replaced by \x{FFFD}.

What perl version and OS are you using? There is a report (perl bug 75106) with title binmode $fh, ":raw" doesn't undo :utf8 on win32!

edited Jun 19 '11 at 03:37

answered Jun 19 '11 at 02:44

cirne100

1,558
2
12
24

encoding is a term, applicable to text files, this one is binary, and in fact ":raw" should help. But as I mentioned it question -- it doesn't – kravitz Jun 19 '11 at 02:49
centos 5.6, perl 5.8.8, apache 2.2.3 – kravitz Jun 19 '11 at 04:20
@kravitz - Did you try the code with an updated perl version? – cirne100 Jun 20 '11 at 00:01
The problem is that I have no privileges to update it on main server, so I need to make it work on this version. I have a feeling, that problem can be in improper mod_perl configuration, since I didn't see one on main machine. – kravitz Jun 20 '11 at 01:11

How to upload binary files in mod_perl with CGI.pm?

4 Answers4