15

I tried taking a proxy from this site: https://hidemy.name/en/proxy-list/?type=4#list

Here is my Puppeteer scraping code (deployed to Heroku), which is returning the error in the title on the .goto() line:

const preparePageForTests = async (page) => {

const userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36';

  await page.setUserAgent(userAgent);

  await page.evaluateOnNewDocument(() => {
    Object.defineProperty(navigator, 'webdriver', {
      get: () => false,
    });
  });

  // Pass the Chrome Test.
  await page.evaluateOnNewDocument(() => {
    // We can mock this in as much depth as we need for the test.
    window.navigator.chrome = {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      }
    };
  });

  await page.evaluateOnNewDocument(() => {
    const originalQuery = window.navigator.permissions.query;
    return window.navigator.permissions.query = (parameters) => (
      parameters.name === 'notifications' ?
        Promise.resolve({ state: Notification.permission }) :
        originalQuery(parameters)
    );
  });

  await page.evaluateOnNewDocument(() => {
    // Overwrite the `plugins` property to use a custom getter.
    Object.defineProperty(navigator, 'plugins', {
      // This just needs to have `length > 0` for the current test,
      // but we could mock the plugins too if necessary.
      get: () => [1, 2, 3, 4, 5],
    });
  });

  await page.evaluateOnNewDocument(() => {
    // Overwrite the `plugins` property to use a custom getter.
    Object.defineProperty(navigator, 'languages', {
      get: () => ['en-US', 'en'],
    });
  });
}

const browser = await puppeteerExtra.launch({ headless: true, args: [                
'--no-sandbox', '--disable-setuid-sandbox', '--proxy-server=socks4://109.94.182.128:4145']});

const page = await browser.newPage();

await preparePageForTests(page);

await page.goto('https://www.google.com/search?q=concerts+near+new+york&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail#htivrt=events&htidocid=L2F1dGhvcml0eS9ob3Jpem9uL2NsdXN0ZXJlZF9ldmVudC8yMDIxLTA2LTA0fDIxMjMzMzg4NTU2Nzc1NDk%3D&fpstate=tldetail') 

I also sometimes get an "ERR_CONNECTION_CLOSED" or "ERR_CONNECTION_FAILED" instead of ERR_CONNECTION_RESET.

Any help in getting rid of this error (presumably by adding more ways to pass the google tests in the preparePageForTests function) would be amazing, thank you!

nickcoding2
  • 142
  • 1
  • 8
  • 34

2 Answers2

10

You're using low-quality public proxies and it's only natural that they will generate network errors and/or be blocked by Google. The simplest solution here is to go for paid ones.

But it's also possible to intercept the error and repeat request if page.open failed:

const collectData = async (page) => {
  try {
    await page.goto('https://www.google.com/search?q=concerts+near+new+york');
    return page.evaluate(() => document.title);
  } catch (err) {
    console.error(err.message);
    return false;
  }
}

let data = false;
let attempts = 0;

// Retry request until it gets data or tries 5 times
while(data === false && attempts < 5)
{
  data = await collectData(page);
  attempts += 1;  
  if (data === false) {
    // Wait a few seconds, also a good idea to swap proxy here*
    await new Promise((resolve) => setTimeout(resolve, 3000));
  }
}


* Modules for changing proxies programmatically:

Vaviloff
  • 16,282
  • 6
  • 48
  • 56
  • I do really like this concept but haven't implemented it yet--to start, I've installed proxy-chain though and then using a high quality paid proxy, I tried using Zilvia Smith's solution here: https://stackoverflow.com/questions/49376910/unable-to-use-proxy-with-puppeteer-error-err-no-supported-proxies-gets-thrown, but what's weird is that I get a timeout error when 'await page.waitForSelector("ul", { timeout: 30000 })' is called. Not sure why because if I don't use the proxy waitForSelector works as intended. I've tried multiple proxies. Do you have any suggestions? – nickcoding2 Jun 08 '21 at 22:04
  • You're probably getting timeout either because the page hasn't loaded in 30 seconds or it doesn't have that list. Do you check that the page contents is actually what you think it should be? With `headless: false` mode or with screenshots? Depending on the proxy you could be shown an interstitial, a modal, a redirect or just a block page. – Vaviloff Jun 09 '21 at 05:51
  • I'd suggest to first train with your own proxy which should be working 100% - install something like [3proxy](https://github.com/3proxy/3proxy) - and visiting something simple like example.com; *then* going fo the scrape target. – Vaviloff Jun 09 '21 at 05:54
  • Also, ensure proxy-chain is set up correctly, try to use it in your browser with something like [SwitchyOmega](https://chrome.google.com/webstore/detail/proxy-switchyomega/padekgcemlokbadohgkifijomclgjgif) – Vaviloff Jun 09 '21 at 05:56
  • I think I'm capturing screenshots in Puppeteer with the method detailed here (https://bitsofco.de/using-a-headless-browser-to-capture-page-screenshots/) but when I downloaded my code from Heroku using the information found here (https://help.heroku.com/FZDDCBLB/how-can-i-download-my-code-from-heroku), the screenshot isn't in those files. Do you have any suggestions for what I can do to take the screenshots? I was thinking run my code locally using node app.js but I don't think that'll work because I have databases and other elements at play that are linked directly to Heroku. – nickcoding2 Jun 09 '21 at 11:43
  • It *is* possible to capture screenshot as a blob and send it somewhere with POST request... But really you should just make a simple minimal local app to figure out things with proxies. – Vaviloff Jun 09 '21 at 12:46
  • 1
    I'm taking the screenshot and I'm getting an error that says: '403. That's an error. Your client does not have permission to get URL/.......... from this server. That's all we know.' So I assume Google is detecting my scraping--do you have any other suggestions? – nickcoding2 Jun 09 '21 at 13:59
  • 1
    Well, that's another issue which deserves its own question, but I think we've resolved this one about proxy network error, wouldn't you say? – Vaviloff Jun 09 '21 at 16:00
  • You're correct, thank you--I'll post another question once I figure out how to phrase it all correctly! – nickcoding2 Jun 09 '21 at 16:11
3

You need to await the page.goto("...")

await page.goto("https://google.com", {waitUntil: "networkidle2"});
Dan Mullin
  • 4,285
  • 2
  • 18
  • 34
  • This doesn't work--the await was in my code but I accidentally removed it from the question--also, the waitUntil causes a timeoutError to occur because I've deployed to Heroku which hits a timeout after 30 seconds. I've also updated my code with the actual URL of the google search (if that helps). – nickcoding2 Jun 04 '21 at 13:05