0

I am trying to download a pdf from a Website.

The website is made with the framework ZK, and it reveals a dynamic URL to the PDF for a window of time when an id number type in a input bar. This step is easy enough and I a able to get the PDF URL which opens up in the browser on a embedded tag.

However, it has been impossible for me to find a way to download the file to my computer. For days, I have tried and read everything from this, to this, to this.

The closes thing I have been able to get with this code:

let [ iframe ] = await page.$x('//iframe');
let pdf_url = await page.evaluate( iframe => iframe.src, iframe)

let res = await page.evaluate( async url => 
                await fetch(url, {
                        method: 'GET',
                        credentials: 'same-origin', // usefull when we are logged into a website and want to send cookies
                        responseType: 'arraybuffer', // get response as an ArrayBuffer
                }).then(response => response.text()), 
                pdf_url 
        )
console.log('res:', res);
//const response = await page.goto(pdf);
fs.writeFileSync('somepdf.pdf', res);

This results in a blank PDF file which is of 92K in size.

While the file I am trying to get is of 52K. I suspect the back-end might be sending me 'dummy' pdf file because my headers on the fetch request might not be correct.

What else can I try?

Here is the link to the PDF page.

You can use the random ID number I found: '1705120630'

halfer
  • 19,824
  • 17
  • 99
  • 186
  • @KJ, thank you so much for replying! So the 92KB file I get is all blank when I open it with Atril and/or firefox. You can find it [here](https://github.com/GoranTopic/puppeteer_playground/blob/master/somepdf.pdf) from my github repo. – Goran Topic's Bot May 02 '22 at 04:09
  • I suspect this is a dummy file the back-end sends something is not quite right with the request, but I don't know what could be wrong with the request `fetch` makes other than the header which are not specified. or maybe it only allows one request to be made before sending the dummy file. I don't really have any other explanation to for it. – Goran Topic's Bot May 02 '22 at 04:12
  • hey @KJ, any luck? – Goran Topic's Bot May 02 '22 at 05:33
  • I have been reading online that people get this problem when they use the wrong encoding to download the pdf. I have been trying using 'base64' in the `fetch` options and in the `fs.writeFileSync`. But it downloads as data file instead of a pdf. – Goran Topic's Bot May 02 '22 at 16:17
  • I saw that the pdf file has ```/Filter /CCITTFaxDecode /DecodeParms << /K -1 /Columns 2496 /BlackIs1 true >> ``` for each page. Do you think that could be the missing 'encoding' I have to do? – Goran Topic's Bot May 02 '22 at 16:20
  • Thank you again @k J, for the explanation and baring with me. So from my understanding so far, the transfer encoding could be 'base64.text' or binary.pdf, but regardless since the I am getting the start of the file correctly the transfer encoding should be correct? Do you idea where the problem could be located at? – Goran Topic's Bot May 02 '22 at 17:23
  • 1
    Whoa this make sense, when I call `.then(res => res.text() )` Puppeteer must be serializing the binary into text to get it from the Chrome Browser to the Nodejs instance where it is saved as such. I guess something I could try would be de-serialize it before saving it? or to write to the fs directly from the browser. – Goran Topic's Bot May 02 '22 at 17:44

0 Answers0