12

I want to get the download content (buffer) and after soon, store the data at my S3 account. So far I wasn't able to find out some solution... Looking for some examples in the web, I noticed that there is a lot of people with this problem. I tried (unsuccessfully) to use the page.on("response") event to retrieve the raw response content, acording the following snippet:

const bucket = [];
await page.on("response", async response => {
        const url = response.url();
        if (
          url ===
          "https://the.earth.li/~sgtatham/putty/0.71/w32/putty-0.71-installer.msi"
        ) {
          try {
            if (response.status() === 200) {
              bucket.push(await response.buffer());
              console.log(bucket);
              // I got the following: 'Protocol error (Network.getResponseBody): No resource with given identifier found' }
            }
          } catch (err) {
            console.error(err, "ERROR");
          }
        }
      });

With such code above, I would intend to detect the event of the download dialog and then, in some way, be able to receive the binary content.

I'm not sure if that's the correct approach. I noticed that some people use a solution based on reading files, in the other words, after download finishes, them read the stored file from the disk. There is a similar discussion at: https://github.com/GoogleChrome/puppeteer/issues/299.

My question is: Is there some way (using puppeteer), to intercept the download stream without having to save the file to disk before?

Thank you very much.

Rogério Oliveira
  • 414
  • 1
  • 4
  • 15

1 Answers1

8

The problem is, that the buffer is cleared as soon as any kind of navigation request is happening. This might be a redirect or page reload in your case.

To solve this problem, you need to make sure that the page does not make any navigation requests as long as you have not finished downloading your resource. To do this we can use page.setRequestInterception.

There is a simple solutions, which might get you started, but might not always work and a more complex solution to this problem.

Simple solution

This solution cancels any navigation requests after the initial request. This means, any reload or navigation on the page will not work. Therefore the buffers of the resources are not cleared.

const browser = await puppeteer.launch();
const [page] = await browser.pages();

let initialRequest = true;
await page.setRequestInterception(true);

page.on('request', request => {
    // cancel any navigation requests after the initial page.goto
    if (request.isNavigationRequest() && !initialRequest) {
        return request.abort();
    }
    initialRequest = false;
    request.continue();
});

page.on('response', async (response) => {
    if (response.url() === 'RESOURCE YOU WANT TO DOWNLOAD') {
        const buffer = await response.buffer();
        // handle buffer
    }
});

await page.goto('...');

Advanced solution

The following code will process each request one after another. In case you download the buffer it will wait until the buffer is downloaded before processing the next request.

const browser = await puppeteer.launch();
const [page] = await browser.pages();

let paused = false;
let pausedRequests = [];

const nextRequest = () => { // continue the next request or "unpause"
    if (pausedRequests.length === 0) {
        paused = false;
    } else {
        // continue first request in "queue"
        (pausedRequests.shift())(); // calls the request.continue function
    }
};

await page.setRequestInterception(true);
page.on('request', request => {
    if (paused) {
        pausedRequests.push(() => request.continue());
    } else {
        paused = true; // pause, as we are processing a request now
        request.continue();
    }
});

page.on('requestfinished', async (request) => {
    const response = await request.response();
    if (response.url() === 'RESOURCE YOU WANT TO DOWNLOAD') {
        const buffer = await response.buffer();
        // handle buffer
    }
    nextRequest(); // continue with next request
});
page.on('requestfailed', nextRequest);

await page.goto('...');
Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
  • 2
    Thomas, after all I would like to thank you for your help. Although, discovered that there is a serious bug in the Puppeteer that can impede me during my journey. When we set page.setRequestInterception(true), the Chromium becomes unable to resolve the pages. The browser stays forever in the about:blank condition. That issue is shown here: https://github.com/GoogleChrome/puppeteer/issues/3118 While this bug isn't solved, the original problem described here will exists. – Rogério Oliveira Mar 29 '19 at 21:14
  • For while, I'm getting over it and reading a file from the memory (a virtual disc) after the download is done. Unfortunately I'm using the traditional Node FS approach. Have you tried to use the Chromium setting the setRequestInterception to true? I'm sorry, but my english is a little weak. Don't worry if you find some mistakes in my text. :/ – Rogério Oliveira Mar 29 '19 at 21:14
  • 1
    Didn't know of these [request interception problems](https://github.com/GoogleChrome/puppeteer/issues/3471). You could also try to download the request a second time with `https.get` by using the cookies and headers from the request. But this will trigger a second download. – Thomas Dondorf Mar 30 '19 at 07:55
  • I gonna follow your suggestion Thomas and evaluate what is the best way for me. Thank you. – Rogério Oliveira Apr 01 '19 at 11:52
  • This solution did not help me, I still received an error on "await" const buffer = await res.buffer(); How did you resolved it? – Oleg Zinchenko Sep 11 '20 at 14:50