23

I am attempting to scrape the html from this NCBI.gov page. I need to include the #see-all URL fragment so that I am guaranteed to get the searchpage instead of retrieving the HTML from an incorrect gene page https://www.ncbi.nlm.nih.gov/gene/119016.

URL fragments are not passed to the server, and are instead used by the javascript of the page client-side to (in this case) create entirely different HTML, which is what you get when you go to the page in a browser and "View page source", which is the HTML I want to retrieve. R readLines() ignores url tags followed by #

I tried using phantomJS first, but it just returned the error described here ReferenceError: Can't find variable: Map, and it seems to result from phantomJS not supporting some feature that NCBI was using, thus eliminating this route to solution.

I had more success with Puppeteer using the following Javascript evaluated with node.js:

const puppeteer = require('puppeteer');
(async() => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
  var HTML = await page.content()
  const fs = require('fs');
  var ws = fs.createWriteStream(
    'TempInterfaceWithChrome.js'
  );
  ws.write(HTML);
  ws.end();
  var ws2 = fs.createWriteStream(
    'finishedFlag'
  );
  ws2.end();
  browser.close();
})();

however this returned what appeared to be the pre-rendered html. how do I (programmatically) get the final html that I get in browser?

Sir_Zorg
  • 331
  • 1
  • 2
  • 5

6 Answers6

13

You can try to change this:

await page.goto(
  'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');

into this:

  await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all', {waitUntil: 'networkidle'});

Or, you can create a function listenFor() to listen to a custom event on page load:

function listenFor(type) {
  return page.evaluateOnNewDocument(type => {
    document.addEventListener(type, e => {
      window.onCustomEvent({type, detail: e.detail});
    });
  }, type);
}`

await listenFor('custom-event-ready'); // Listen for "custom-event-ready" custom event on page load.

LE:

This also might come in handy:

await page.waitForSelector('h3'); // replace h3 with your selector
Carol-Theodor Pelu
  • 906
  • 2
  • 10
  • 26
10

Maybe try to wait

await page.waitForNavigation(5);

and after

let html = await page.content();
3

I had success using the following to get html content that was generated after the page has been loaded.

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  await page.goto(url);
  await page.waitFor(2000);
  let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
  console.log(html_content);
} catch (err) {
  console.log(err);
}

Hope this helps.

Darren Hall
  • 920
  • 8
  • 13
1

Waiting for network idle was not enough in my case, so I used dom loaded event:

await page.goto(url, {waitUntil: 'domcontentloaded', timeout: 60000} );
const data = await page.content();
adlerer
  • 1,010
  • 11
  • 14
1

Indeed you need innerHTML:

fs.writeFileSync( "test.html", await (await page.$("html")).evaluate( (content => content.innerHTML ) ) );
George Y.
  • 11,307
  • 3
  • 24
  • 25
0

If you want to actually await a custom event, you can do it this way.

const page = await browser.newPage();

/**
  * Attach an event listener to page to capture a custom event on page load/navigation.
  * @param {string} type Event name.
  * @return {!Promise}
  */
function addListener(type) {
  return page.evaluateOnNewDocument(type => {
    // here we are in the browser context
    document.addEventListener(type, e => {
      window.onCustomEvent({ type, detail: e.detail });
    });
  }, type);
}

const evt = await new Promise(async resolve => {
  // Define a window.onCustomEvent function on the page.
  await page.exposeFunction('onCustomEvent', e => {
    // here we are in the node context
    resolve(e); // resolve the outer Promise here so we can await it outside
  });

  await addListener('app-ready'); // setup listener for "app-ready" custom event on page load
  await page.goto('http://example.com');  // N.B! Do not use { waitUntil: 'networkidle0' } as that may cause a race condition
});

console.log(`${evt.type} fired`, evt.detail || '');

Built upon the example at https://github.com/GoogleChrome/puppeteer/blob/master/examples/custom-event.js

mflodin
  • 1,093
  • 1
  • 12
  • 22