3

So in my web scraper function, I have the below lines of code:

let portList = [9050, 9052, 9053, 9054, 9055, 9056, 9057, 9058, 9059, 9060];
let spoofPort = portList[Math.floor(Math.random()*portList.length)];
console.log("The chosen port was " + spoofPort);

const browser = await puppeteerExtra.launch({ headless: true, args: [                
'--no-sandbox', '--disable-setuid-sandbox', '--proxy-server=socks5://127.0.0.1:' + spoofPort                                               
]});

const page = await browser.newPage();

const userAgent = 'Mozilla/5.0 (X11; Linux x86_64)' +           
      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';

await page.setUserAgent(userAgent);

I'm trying to rotate the IP address for each request (the function that contains this code is essentially called on each request from a client) so that I don't get blocked by the scraped website so fast. I get the below error:

2021-05-17T12:08:19.625349+00:00 app[web.1]: The chosen port was 9050
2021-05-17T12:08:20.042016+00:00 app[web.1]: Error: net::ERR_PROXY_CONNECTION_FAILED at https://expampleDomanPlaceholder.com
2021-05-17T12:08:20.042018+00:00 app[web.1]: at navigate (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
2021-05-17T12:08:20.042018+00:00 app[web.1]: at processTicksAndRejections (internal/process/task_queues.js:93:5)
2021-05-17T12:08:20.042019+00:00 app[web.1]: at async FrameManager.navigateFrame (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
2021-05-17T12:08:20.042020+00:00 app[web.1]: at async Frame.goto (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
2021-05-17T12:08:20.042021+00:00 app[web.1]: at async Page.goto (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:819:16)
2021-05-17T12:08:20.042021+00:00 app[web.1]: at async /app/app.js:174:9

I've tried the solutions detailed in these posts, but maybe the issue is with my userAgent?:

Getting error when attempting to use proxy server in Node.js / Puppeteer

https://github.com/puppeteer/puppeteer/issues/2472

UPDATE: I tried to use this buildpack (https://github.com/iamashks/heroku-buildpack-tor-proxy.git) but it kept causing my web dyno to break (an 'H14' Error was returned, which means you have to clear the build packs and re-add them). Not sure how to proceed from here as that really seemed to be the only solution I was able to come across.

nickcoding2
  • 142
  • 1
  • 8
  • 34
  • 1
    The error is right in the log: `net::ERR_PROXY_CONNECTION_FAILED` It seems that Tor is not configured on not working. – Vaviloff May 17 '21 at 13:07
  • 1
    @Vaviloff For some context, I'm deploying to Heroku and working in a Node.js environment on a Mac. Looking at this link (https://medium.com/@jsilvax/running-puppeteer-with-tor-45cc449e5672), it seems like you're right about me not downloading Tor. But if I'm deploying to Heroku, how do I make sure tor works? Do I install this package or something: https://www.npmjs.com/package/tor-request – nickcoding2 May 17 '21 at 13:14
  • @Vaviloff Do you have any suggestions? – nickcoding2 May 17 '21 at 20:41
  • 1
    I'd suggest that you search for somethin like [using tor on heroku](https://www.google.com/search?q=using+tor+on+heroku) and then adapt your app accordingly – Vaviloff May 18 '21 at 00:50
  • @Vaviloff So I tried adding the Tor buildpacks from your link to my Heroku app but still wasn't able to get my code working. I also tried a bunch of other "free-proxy" masks but none of them work (these include puppeteer-page-proxy and get-free-https-proxy). Do you know anyone who has deployed Tor to Heroku before who you could put me in contact with? – nickcoding2 May 18 '21 at 12:15
  • Alas, no, I've only learned there are tor buildpacks today :) Maybe get a cheap vps and try to deploy there first? – Vaviloff May 18 '21 at 17:08
  • @Vaviloff have you figured this out yet? You can get this done using a local upstream proxy server that: catches your HTTP requests, allows you to modify them, applies a different proxy to your HTTP request via per request or per per page, modifies the – Quan Truong Jan 21 '22 at 02:08

1 Answers1

2

So there are a few issues.

  1. Error message posted has missing placeholder
  2. That request fails as its incorrectly spelled.
  3. You have to actually supply the proxy server to the browser object. It must be initialized.
Error: net::ERR_PROXY_CONNECTION_FAILED at https://expampleDomanPlaceholder.com

Here is an example of a proxy server in cambodia

We will use SOCKS4 proxy and IP location of this proxy at Cambodia.
Proxy IP address 96.9.77.192 and port 55796 (not sure if it still works) 


const puppeteer = require('puppeteer');

(async () => {
    let launchOptions = { headless: false, 
                          args: ['--start-maximized',
                                 '--proxy-server=socks4://96.9.77.192:55796'] // this is where we set the proxy
                        };

    const browser = await puppeteer.launch(launchOptions);
    const page = await browser.newPage();

    // set viewport and user agent (just in case for nice viewing)
    await page.setViewport({width: 1366, height: 768});
    await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');

    // go to whatismycountry.com to see if proxy works (based on geography location)
    await page.goto('https://whatismycountry.com');

    // close the browser
    await browser.close();
})();

#Proxy Issue
If the proxy host requires AUTH then the example below would be more fitting. 


'use strict';

const puppeteer = require('puppeteer');

(async () => {
  const username = process.env.USER
  const password = process.env.PASS
  const url = 'https://www.google.com'

  const browser = await puppeteer.launch({
    # proxy host must be correct.
    args: [
      '--proxy-server=socks5://proxyhost:8000',
    ],
  });

  const page = await browser.newPage();

  await page.authenticate({
    username,
    password,
  });

  await page.goto(url);

  await browser.close();
})();

this worked with tor. 
 Tor ('--proxy-server=socks5://localhost:9050')

References: thanks to @Grant Miller for the TOR testing.

https://dev.to/sonyarianto/practical-puppeteer-using-proxy-to-browse-a-page-1m82

How to make puppeteer work through socks5 proxy?

Josh
  • 1,059
  • 10
  • 17
  • The placeholder is a google search result that I didn't want to put into stack. I don't think the IP/Port combo you provided is up and running anymore so I just looked up a free US-based IP/Port combo and found (socks4://98.162.25.23:4145). I tried putting that into the --proxy -server tag but now I'm getting an 'Error: net::ERR_SSL_PROTOCOL_ERROR' at the .goto() line. I've tried multiple proxies and tried seeing ignoreHTTPSErrors: true. I'm wondering, does google block socks5 transport protocol? Do you have any suggestions? – nickcoding2 May 28 '21 at 23:39
  • Also, how do I actually npm install Tor (or whatever I have to do)? The documentation on downloading Tor for heroku/Node.js seems very limited... – nickcoding2 May 29 '21 at 00:13