4

I'm trying to scrape and save the results into my database. I'm using NodeJS (sails.js framework)

This is a working a example using cheerio:

getRequest('some-url').then((data) => {
    const $ = cheerio.load(data);
    let title = $('.title').each(function (i, element) {
        let a = $(this);
        let title = a.text(); // Title
        MyModel.create({title : title}).exec((err, event) => {
        });
    });
});

The problem with cheerio is that it's not acting as a browser and does not render any javascript-rendered web pages.

So I decided to try nightmare js, and it was a nightmare to do the same:

var articles = [];
Promise.resolve(nightmare
    .goto('some-url')
    .wait(0)
    .inject('js', 'assets/js/dependencies/jquery-3.2.1.min.js')
    .evaluate((articles) => {
        var article = {};
        var list = document.querySelectorAll('h3 a');
        var elementArray = [...list];
        elementArray.forEach(el => {
            article.title = el.innerText;
            articles.push(article);
            myModel.create({title : article.title}).exec((err, event) => {
            });
        });
        return articles;
    }, articles)
    .end())
    .then((data) => {
        console.log(data);
    });

The problems

News is not defined inside the evaluate() function. the evaluate function seem to accept only strings, and News is a model created by sails.js.

Also, the articles array is populated with the same data.

Is there any simpler way to scrape a webpage after DOM render using NodeJS?

ponury-kostek
  • 7,824
  • 4
  • 23
  • 31
TheUnreal
  • 23,434
  • 46
  • 157
  • 277
  • nightmare works pretty good for scrapping. You cannot use Nodejs modules inside evaluate(not simply atleast). But you can pass any json from evaluate(). In your case, call myModel.create() in then() function which has nodejs modules. – devilpreet Jun 23 '17 at 12:26

1 Answers1

0

First off, I would ditch the use of Promise Chains and strictly use await/async syntax because it is just more clear and easy to work with

Secondly, Yes, there is another option for you and it might work faster and better, depending on what you're trying to do

I am talking about using Puppeteer by Google, simulating the chromium browser and controlling it over an API, just like NightmareJs

I have also written a good starting guide for Scraping with Puppeteer, I'm sure it will help!

Fabian
  • 336
  • 3
  • 8