0

I would like to write a chrome extension that downloads PDF file from a website accepting POST requests, and upload the PDF file to my localhost server. Here's my attempt:

  $.ajax({
        url: 'http://example.com/download.action',
        data: data,
        type: 'POST',
        cache: false,
        crossDomain: true,
        success: function(response) {
            $.ajax({
                url: 'http://localhost/getpdf.php',
                data: response,
                type: 'POST',
                cache: false,
                contentType: 'application/octet-stream',
                processData: false,
                crossDomain: true
            });
        }
    });

From the console I observed the response of the download ajax request, it's a binary content beginning with "%PDF-1.7.%...", seems reasonable. Then in localhost server side, I use some simple PHP code to save the PDF file:

<?php
$raw_data = file_get_contents('php://input');
$f = fopen('test.pdf', 'w');
fwrite($f, $raw_data);
fclose($f);
?>

File is saved. But the saved PDF file can't be opened by Adobe Reader (file is damaged), and the file size is about 2 times larger than the original one.

I checked the binaries of the saved PDF file and the original one by vim -b, here're the first 10 lines:

The original one:

0000000: 2550 4446 2d31 2e37 0a25 e4e3 cfd2 0a36  %PDF-1.7.%.....6
0000010: 2030 206f 626a 0a3c 3c2f 5479 7065 2f58   0 obj.<</Type/X
0000020: 4f62 6a65 6374 0a2f 5375 6274 7970 652f  Object./Subtype/
0000030: 466f 726d 0a2f 4242 6f78 5b30 2030 2035  Form./BBox[0 0 5
0000040: 3935 2e32 3736 2038 3431 2e38 395d 0a2f  95.276 841.89]./
0000050: 5265 736f 7572 6365 733c 3c2f 584f 626a  Resources<</XObj
0000060: 6563 743c 3c2f 496d 3020 3720 3020 522f  ect<</Im0 7 0 R/
0000070: 496d 3120 3820 3020 522f 496d 3220 3920  Im1 8 0 R/Im2 9 
0000080: 3020 523e 3e2f 436f 6c6f 7253 7061 6365  0 R>>/ColorSpace
0000090: 3c3c 2f43 5330 2031 3020 3020 522f 4353  <</CS0 10 0 R/CS

The saved one:

0000000: 2550 4446 2d31 2e37 0a25 efbf bdef bfbd  %PDF-1.7.%......
0000010: efbf bdef bfbd 0a36 2030 206f 626a 0a3c  .......6 0 obj.<
0000020: 3c2f 5479 7065 2f58 4f62 6a65 6374 0a2f  </Type/XObject./
0000030: 5375 6274 7970 652f 466f 726d 0a2f 4242  Subtype/Form./BB
0000040: 6f78 5b30 2030 2035 3935 2e32 3736 2038  ox[0 0 595.276 8
0000050: 3431 2e38 395d 0a2f 5265 736f 7572 6365  41.89]./Resource
0000060: 733c 3c2f 584f 626a 6563 743c 3c2f 496d  s<</XObject<</Im
0000070: 3020 3720 3020 522f 496d 3120 3820 3020  0 7 0 R/Im1 8 0 
0000080: 522f 496d 3220 3920 3020 523e 3e2f 436f  R/Im2 9 0 R>>/Co
0000090: 6c6f 7253 7061 6365 3c3c 2f43 5330 2031  lorSpace<</CS0 1

It seems some words are changed (maybe charset problem?)

Any hints about this?

Logan Ding
  • 1,761
  • 1
  • 12
  • 23
  • *maybe charset problem* - yes, bytes > 128 are replaced by the [Unicode Character 'REPLACEMENT CHARACTER'](http://www.fileformat.info/info/unicode/char/0fffd/index.htm), efbfbd in UTF-8. A PDF is binary data, though, so no charset-related replacements should happen at all. Try adding `b` (binary) to the `fopen` mode. – mkl Aug 04 '14 at 06:52
  • @mkl Thanks, I tried with adding `b` but the problem remains. May be the problem is in javascript side? – Logan Ding Aug 04 '14 at 07:35
  • *May be the problem is in javascript side* - probably yes. Your hex dumps definitively look like some step inbetween treated the PDF as ASCII text (thus replacing every byte >127). Which step it is, I don't know. – mkl Aug 04 '14 at 07:57

2 Answers2

0

You may want to use this way to read pdf text and then write it as your new pdf file

how to get text from pdf file and save it into DB

Community
  • 1
  • 1
saad
  • 1,354
  • 1
  • 14
  • 21
0

Finally I found a solution that meets my requirements.

As mentioned by @mkl, there's some replacement to UTF-8 on the original PDF binary data, but we don't know in which step this replacement happens. So I start to search about sending/receiving binary data instead of strings, and I found this, which introduced a feature called "arraybuffer".

According to the article above I changed my js function to this and it works:

var form = $('<form method="post"></form>');
for (var i in data) {
    form.append('<input name="'+i+'" value="'+data[i]+'" />');
}

data = form.serialize();

var oReq = new XMLHttpRequest();
oReq.open('POST', 'http://example.com/download.action', true);
oReq.setRequestHeader("Content-type","application/x-www-form-urlencoded");
oReq.responseType = "arraybuffer";

oReq.onload = function (oEvent) {
    var arrayBuffer = oReq.response;
    if (arrayBuffer) {
        var xhr = new XMLHttpRequest;
        xhr.open("POST", 'http://localhost/getpdf.php', false);
        xhr.send(arrayBuffer);
    }
};

oReq.send(data);
Logan Ding
  • 1,761
  • 1
  • 12
  • 23