0

I am trying to scrappe coursera webpage using PhantomJS. However when I try to do that it is unable to display the actual content instead it only shows loading. When we try to access coursera you can see an intermediate display showing loading and this is being shown. However since phantomJS is a headless browser should'nt it be able to retrieve the source code exactly as what a browser would do? I tried setting the timeouts, useragents but to no avail. Any pointers?

EDIT: Please find the code snippet for simple scrapping:

var webPage = require('webpage');
var system = require('system');
var page = webPage.create();
page.settings.resourceTimeout = 5000; // 5 seconds
var url = system.args[1];

page.open(url, function (status) {
        if(status === 'success') {
        var content = page.content;
        console.log(content);
        phantom.exit();
        }
        else
        {console.log("Error!")
        phantom.exit()
        }
    });

EDIT:

Been trying this a bit more but still no luck. (Just wondering if OP tried this further with any luck)

var page = require('webpage').create();
page.settings.resourceTimeout = 10000; // 5 seconds
page.settings.userAgent   = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36';
var system = require('system');
var fs = require('fs');

if(system.args.length !== 3) {
  console.log('Usage: phantomjs text-scraper.js <url> <output file>');
  phantom.exit();
}

var url = system.args[1];
var outfile = system.args[2];


page.open(url);
//page.open(url, function(status) {
  //var output = url + '\n';
  //console.log(output);
  //if(status === 'success') {
page.onLoadFinished = function(msg) {
      var text = page.evaluate(function () {
        return document.title + '\n' + document.body.innerText;
      });
    console.log(text);
      //output += text;
      //fs.write(outfile, output);
      //phantom.exit()
  //} else {
  //  console.log("Error!")
    phantom.exit();
  //}
  //}
};
Ajay Nair
  • 1,827
  • 3
  • 20
  • 33
Trancey
  • 699
  • 1
  • 8
  • 18
  • use greaseomnkey or tampermonkey to scrape; they run all the latest stuff, you can see what you're doing, and it just needs a browser instead of a node box. – dandavis Jan 04 '15 at 17:16
  • I see, I will try that. But those are just for analyzing? – Trancey Jan 04 '15 at 17:33
  • what do you mean by "just for analyzing"? anything you see in the browser you can collect, organize, filter, and save using the "monkeys". to wit; https://github.com/rndme/download can put physical files on the local machine from a string that a monkey makes from html on any url. – dandavis Jan 04 '15 at 17:40
  • 1
    Could you show the code you have so far? Thanks. – alecxe Jan 04 '15 at 18:39
  • Please register to the [`onConsoleMessage`](http://phantomjs.org/api/webpage/handler/on-console-message.html), [`onError`](http://phantomjs.org/api/webpage/handler/on-error.html), [`onResourceError`](http://phantomjs.org/api/webpage/handler/on-resource-error.html), [`onResourceTimeout`](http://phantomjs.org/api/webpage/handler/on-resource-timeout.html) events. Maybe there are errors. Also, don't forget to add your code into the question. – Artjom B. Jan 04 '15 at 21:47
  • None of them registered any errors. I tried all of the above events – Trancey Jan 05 '15 at 11:45
  • possible duplicate of [Page is not completely loaded/rendered when onLoadFinished fires](http://stackoverflow.com/questions/27260702/page-is-not-completely-loaded-rendered-when-onloadfinished-fires) – Artjom B. Jan 06 '15 at 12:19

0 Answers0