0

When I access a doi URL, it is redirected to the following URL.

https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715

But it is not the final URL https://www.sciencedirect.com/science/article/pii/S1550413115002715?via%3Dihub

$ wget --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' https://doi.org/10.1016/j.cmet.2015.06.004
$ grep Redirect j.cmet.2015.06.004.html |grep meta
<meta HTTP-EQUIV="REFRESH" content="2; url='/retrieve/articleSelectPrefsPerm?Redirect=https%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS1550413115002715%3Fvia%253Dihub&amp;key=f0d7d908599d0c4f0ee467d0e225836b1927eb91'"/>
$ wget -S -o /dev/stderr --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' https://doi.org/10.1016/j.cmet.2015.06.004 > /dev/null
--2019-08-08 06:01:13--  https://doi.org/10.1016/j.cmet.2015.06.004
Resolving doi.org (doi.org)... 104.26.9.237, 104.26.8.237, 2606:4700:20::681a:8ed, ...
Connecting to doi.org (doi.org)|104.26.9.237|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 302 
  Date: Thu, 08 Aug 2019 11:01:14 GMT
  Content-Type: text/html;charset=utf-8
  Content-Length: 209
  Connection: keep-alive
  Set-Cookie: __cfduid=d1dd9844bf9c103fcc56abf104a78957b1565262073; expires=Fri, 07-Aug-20 11:01:13 GMT; path=/; domain=.doi.org; HttpOnly
  Vary: Accept
  Location: https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715
  Expires: Thu, 08 Aug 2019 11:27:57 GMT
  Link: <https://dul.usage.elsevier.com/doi/>; rel=dul
  Strict-Transport-Security: max-age=86400; includeSubDomains
  Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
  Server: cloudflare
  CF-RAY: 5030fdb9c8dde04d-DFW
Location: https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715 [following]
--2019-08-08 06:01:14--  https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715
Resolving linkinghub.elsevier.com (linkinghub.elsevier.com)... 18.204.111.22, 34.198.26.18
Connecting to linkinghub.elsevier.com (linkinghub.elsevier.com)|18.204.111.22|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 
  Date: Thu, 08 Aug 2019 11:01:14 GMT
  Content-Type: text/html;charset=UTF-8
  Content-Length: 8144
  Connection: keep-alive
  Set-Cookie: JSESSIONID=9EB99263F2DD8482804BE74C0DDBAE51; Path=/retrieve; Secure; HttpOnly
  Pragma: no-cache
  Cache-Control: no-cache, no-store, must-revalidate
  Expires: Thu, 01 Jan 1970 00:00:00 GMT
  Set-Cookie: visitorId=vOzKJBQjOR53unZLGF8y; Max-Age=2147483647; Expires=Tue, 26-Aug-2087 14:15:21 GMT; Path=/
  P3P: CP="NON DSP COR CUR ADM DEV TAI PSA PSD OUR IND UNI NAV STA PRE COM INT CNT",policyref="https://linkinghub.elsevier.com/retrieve/static/P3P/IHUB-p3p.xml"
  Content-Language: en-US
Length: 8144 (8.0K) [text/html]
Saving to: ‘j.cmet.2015.06.004’

     0K .......                                               100%  123M=0s

2019-08-08 06:01:14 (123 MB/s) - ‘j.cmet.2015.06.004’ saved [8144/8144]

I tried the following puppeteer code to try to handle it automatically. But it fails. Does anybody know to automatically get it redirect to the final page?

$ cat puptr2cntnt.js 
#!/usr/bin/env node
// vim: set noexpandtab tabstop=2:

const puppeteer = require('puppeteer');
const fs = require('fs');

const url = process.argv[2];

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    const content = await page.content();
    console.log(content);
    await browser.close();
})();
$ ./puptr2cntnt.js  https://doi.org/10.1016/j.cmet.2015.06.004
(node:73532) UnhandledPromiseRejectionWarning: Error: Execution context was destroyed, most likely because of a navigation.
    at rewriteError (/usr/local/lib/node_modules/puppeteer/lib/ExecutionContext.js:161:15)
    at processTicksAndRejections (internal/process/task_queues.js:89:5)
    at async ExecutionContext.evaluateHandle (/usr/local/lib/node_modules/puppeteer/lib/ExecutionContext.js:119:56)
    at async ExecutionContext.evaluate (/usr/local/lib/node_modules/puppeteer/lib/ExecutionContext.js:48:20)
    at async DOMWorld.content (/usr/local/lib/node_modules/puppeteer/lib/DOMWorld.js:185:12)
    at async Page.content (/usr/local/lib/node_modules/puppeteer/lib/Page.js:612:12)
    at async /Users/pengy/linux/bin/wrappercomposite/src/xplat/puptrxplat/src/puptr2cntnt/node/default/puptr2cntnt.js:13:18
  -- ASYNC --
    at ExecutionContext.<anonymous> (/usr/local/lib/node_modules/puppeteer/lib/helper.js:110:27)
    at ExecutionContext.evaluate (/usr/local/lib/node_modules/puppeteer/lib/ExecutionContext.js:48:31)
    at ExecutionContext.<anonymous> (/usr/local/lib/node_modules/puppeteer/lib/helper.js:111:23)
    at DOMWorld.evaluate (/usr/local/lib/node_modules/puppeteer/lib/DOMWorld.js:112:20)
    at processTicksAndRejections (internal/process/task_queues.js:89:5)
    at async DOMWorld.content (/usr/local/lib/node_modules/puppeteer/lib/DOMWorld.js:185:12)
  -- ASYNC --
    at Frame.<anonymous> (/usr/local/lib/node_modules/puppeteer/lib/helper.js:110:27)
    at Page.content (/usr/local/lib/node_modules/puppeteer/lib/Page.js:612:49)
    at Page.<anonymous> (/usr/local/lib/node_modules/puppeteer/lib/helper.js:111:23)
    at /Users/pengy/linux/bin/wrappercomposite/src/xplat/puptrxplat/src/puptr2cntnt/node/default/puptr2cntnt.js:13:29
    at processTicksAndRejections (internal/process/task_queues.js:89:5)
(node:73532) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:73532) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
O. Jones
  • 103,626
  • 17
  • 118
  • 172
user1424739
  • 11,937
  • 17
  • 63
  • 152

2 Answers2

0

You have to wait for navigation to be finished, you can add the waitForNavigation method after goto method.

await page.waitForNavigation({waituntil: 'domcontentloaded'});

Or just add the {waituntil: 'domcontentloaded'} value as the second argument to the goto method.

await page.goto(url, {waituntil: 'domcontentloaded'});

Full script:

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, {waituntil: 'domcontentloaded'});
  const content = await page.content();
  console.log(content);
  await browser.close();
})();

When to consider navigation succeeded, defaults to load. Given an array of event strings, navigation is considered to be successful after all events have been fired. Events can be either:

  • load - consider navigation to be finished when the load event is fired.
  • domcontentloaded - consider navigation to be finished when the DOMContentLoaded event is fired.
  • networkidle0 - consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
  • networkidle2 - consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.

Read more docs.

Yevhen Laichenkov
  • 7,746
  • 2
  • 27
  • 33
  • The webpage saved is still not the same as what I see in Chrome. I suspect that it is because some javascript code is run. Is there a way to wait until everything is loaded? https://stackoverflow.com/questions/53992509/not-all-dom-content-is-ready-after-waituntil-domcontentloaded – user1424739 Aug 08 '19 at 12:07
  • Hey @user1424739. I have edited the answer, added more events for waiting. – Yevhen Laichenkov Aug 08 '19 at 12:08
  • I tried `networkidle0`. But it still does not show all the webpage. I think the problem is that the webpage pauses for a brief period of time then load more. But `networkidle0` will consider the webpage has been fully loaded and then return. How to resolve this issue? – user1424739 Aug 08 '19 at 14:34
  • I would suggest adding the `waitForSelector` method to wait for the `selector` to be visible on the page. For example: `await page.waitForSelector('.article__sections', {visible: true})` – Yevhen Laichenkov Aug 09 '19 at 07:04
0

You will have to call page.waitForNavigation multiple times, as in your case the website redirects to a page, which waits some time before redirecting to another page. To automate this, you can use this function:

async function waitForMoreNavigation(page) {
  try {
    while (true) {
      await page.waitForNavigation({ timeout: 2000 });
    }
  } catch (err) {} // timeout is thrown, abort the progress
}

The function keeps waiting for more navigation inside a loop until there are no more navigation events happening and the timeout hits. Keep in mind that this will wait for at least two seconds before progressing. Depending on your task you might want to change the value for timeout.

Code Sample

Use the function after your page.goto call like this:

await page.goto('https://linkinghub.elsevier.com/retrieve/pii/S1550413115002715');
await waitForMoreNavigation(page);
console.log(page.url());
Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105