38

Please tell me how to properly use a proxy with a puppeteer and headless Chrome. My option does not work.

const puppeteer = require('puppeteer');
(async () => {
  const argv = require('minimist')(process.argv.slice(2));

  const browser = await puppeteer.launch({args: ["--proxy-server =${argv.proxy}","--no-sandbox", "--disable-setuid-sandbox"]});
  const page = await browser.newPage();

  await page.setJavaScriptEnabled(false);
  await page.setUserAgent(argv.agent);
  await page.setDefaultNavigationTimeout(20000);
  try{
  await page.goto(argv.page);

  const bodyHTML = await page.evaluate(() => new XMLSerializer().serializeToString(document))
  body = bodyHTML.replace(/\r|\n/g, '');
  console.log(body);
}catch(e){
        console.log(e);
}
  await browser.close();
})();
Matthew Schuchard
  • 25,172
  • 3
  • 47
  • 67
Irina Kazhamiakina
  • 489
  • 1
  • 4
  • 4

8 Answers8

64

You can find an example about proxy at here

'use strict';

const puppeteer = require('puppeteer');

(async() => {
  const browser = await puppeteer.launch({
    // Launch chromium using a proxy server on port 9876.
    // More on proxying:
    //    https://www.chromium.org/developers/design-documents/network-settings
    args: [ '--proxy-server=127.0.0.1:9876' ]
  });
  const page = await browser.newPage();
  await page.goto('https://google.com');
  await browser.close();
})();
Chuong Tran
  • 3,131
  • 17
  • 25
22

It's possible with puppeteer-page-proxy. It supports setting a proxy for an entire page, or if you like, it can set a different proxy for each request. And yes, it works both in headless and headful Chrome.

First install it:

npm i puppeteer-page-proxy

Then require it:

const useProxy = require('puppeteer-page-proxy');

Using it is easy; Set proxy for an entire page:

await useProxy(page, 'http://127.0.0.1:8000');

If you want a different proxy for each request,then you can simply do this:

await page.setRequestInterception(true);
page.on('request', req => {
    useProxy(req, 'socks5://127.0.0.1:9000');
});

Then if you want to be sure that your page's IP has changed, you can look it up;

const data = await useProxy.lookup(page);
console.log(data.ip);

It supports http, https, socks4 and socks5 proxies, and it also supports authentication if that is needed:

const proxy = 'http://login:pass@127.0.0.1:8000'

Repository: https://github.com/Cuadrix/puppeteer-page-proxy

Cuadrix
  • 433
  • 3
  • 10
  • 1
    How can you use setRequestInterception to block images/css/fonts and useProxy at page-level ? If I use useProxy for each request it's too much slow – sparkle Apr 11 '20 at 16:23
  • 1
    note: this library uses nodejs-got to download page, so it bypasses chrome downloader, it makes extra request to get a page. It won't work for many use cases. – Pawel Miech May 10 '23 at 12:15
21

Do not use

"--proxy-server =${argv.proxy}"  

This is a normal string instead of template literal use ` instead of "

`--proxy-server =${argv.proxy}`

Otherwise argv.proxy will not be replaced.

Check this string before you pass it to launch function to make sure it's correct and you may want to visit http://api.ipify.org/ in that browser to make sure the proxy works normally

Manas Khandelwal
  • 3,790
  • 2
  • 11
  • 24
qrt
  • 342
  • 2
  • 7
4

I see https://github.com/Cuadrix/puppeteer-page-proxy and https://github.com/gajus/puppeteer-proxy recommended above, and I want to emphasize that these two packages are technically not using Chrome instance to perform actual network request, here is what they are doing instead:

  1. when the user code initiates network request of Puppeteer, e.g. calls page.goto(), the proxy package intercepts this outgoing HTTP request and pauses it
  2. the proxy package passes the request to another network library (Got)
  3. Got performs actual network request, through the proxy specified
  4. Got now needs to pass all the network response data back to Puppeteer! This means a bunch of interesting things the proxy package now needs to manage, like copying cookie headers from raw HTTP set-cookie format to puppeteer format

While this might be a viable approach for a lot of cases, you need to understand that this changes your HTTP request TLS fingerprint so your HTTP request might get blocked by some websites, particularly the ones which are using Cloudflare bot detection (because the website now sees that your request originates from Node.js, not from Chrome).

Alternative method of setting proxy in Puppeteer.

Launch args of Chrome are good if you want to use one proxy for all websites. What if you still want to have one Chrome instance use multiple proxies, but you don't want to use 2 packages mentioned above?

createIncognitoBrowserContext Puppeteer function to the rescue:

// Create a new incognito browser context
const context = await browser.createIncognitoBrowserContext({ proxy: 'http://localhost:2022' });
// Create a new page inside context.
const page = await context.newPage();

// authenticate in proxy using basic browser auth
await page.authenticate({username:user, password:password});
// ... do stuff with page ...
await page.goto('https://example.com');
// Dispose context once it's no longer needed.
await context.close();

proxy-chain package

If your proxy requires auth, and you don't like the page.authenticate call, the proxy might be set using proxy-chain npm package.

proxy-chain launches intermediate proxy on your localhost which allows to do some nice things. Read more on technical details of proxy-chain package implementation: https://pixeljets.com/blog/how-to-set-proxy-in-puppeteer

Anthony S
  • 124
  • 2
2

if you want to use different proxy for per page, try this, use https-proxy-agent or http-proxy-agent to proxy request for per page

2

You can use https://github.com/gajus/puppeteer-proxy to set proxy either for entire page or for specific requests only, e.g.

import puppeteer from 'puppeteer';
import {
  createPageProxy,
} from 'puppeteer-proxy';

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const pageProxy = createPageProxy({
    page,
    proxyUrl: 'http://127.0.0.1:3000',
  });

  await page.setRequestInterception(true);

  page.once('request', async (request) => {
    await pageProxy.proxyRequest(request);
  });

  await page.goto('https://example.com');
})();

To skip proxy simply call request.continue() conditionally.

Using puppeteer-proxy Page can have multiple proxies.

Gajus
  • 69,002
  • 70
  • 275
  • 438
0

You can find proxies list on Private Proxy and use it with the code below

const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');

(async() => {
   // Proxies List from Private proxies
    const proxiesList = [
                'http://skrll:au4....',
                ' http://skrll:au4....',
                ' http://skrll:au4....',
                ' http://skrll:au4....',
                ' http://skrll:au4....',
     ];
    
    const oldProxyUrl = proxiesList[Math.floor(Math.random() * (proxiesList.length))];
    const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
    
    const browser = await puppeteer.launch({
            headless: true,
            ignoreHTTPSErrors: true,
            args: [
              `--proxy-server=${newProxyUrl}`,
              `--ignore-certificate-errors`,
              `--no-sandbox`,
              `--disable-setuid-sandbox`
            ]
     });

    const page = await browser.newPage();
    await page.authenticate();

    // 
    // you code here
    //
     
    // close proxy chain
    await proxyChain.closeAnonymizedProxy(newProxyUrl, true);
})();

You can find the full post I wrote here.

LW001
  • 2,452
  • 6
  • 27
  • 36
Lionel Lakson
  • 19
  • 1
  • 3
-3

According to my experience, all above fail due to different reasons. I find that applying proxy on the entire OS works each time. I get no proxy fails. This strategy works on both Windows and Linux.

This way, I get zero puppeteer bot failures. Bear in mind, I am spinning up 7000 bots per server. I am running this on 7 servers.

Khalil
  • 1,047
  • 4
  • 17
  • 34
  • I think you should improve your question where it says "all above fail...", how often do they fail? the proxy setting is there because it works, be more specific of the situations that yours or others would fail. Also, can you clarify what you mean by "applying proxy on the entire OS"? I'd like to know more, it sounds like you're saying you have 7 services that act as 7 proxies which is a small number of proxies. – Kevin Danikowski Dec 11 '21 at 20:01