1

I'm kind of starting in Javascript and I need help to figure out how can I make this code synchronous while looping through for loop. Basically what I'm doing is making multiple POST requests inside for loops and then im scrapping the data using the library X-Ray and finally I'm saving the result to a Mongo Database. The output is ok but it comes in unordered way and suddenly hangs and I have to force close using ctrl+C. This is my function:

  function getdata() {
  const startYear = 1996;
  const currentYear = 1998; // new Date().getFullYear()

for (let i = startYear; i <= currentYear; i++) {
for (let j = 1; j <= 12; j++) {
  if (i === startYear) {
    j = 12;
  }

  // Form to be sent
  const form = {
    year: `${i}`,
    month: `${j}`,
    day: '01',
  };

  const formData = querystring.stringify(form);
  const contentLength = formData.length;

  // Make HTTP Request
  request({
    headers: {
      'Content-Length': contentLength,
      'Content-Type': 'application/x-www-form-urlencoded',
    },
    uri: 'https://www.ipma.pt/pt/geofisica/sismologia/',
    body: formData,
    method: 'POST',
  }, (err, res, html) => {

    if (!err && res.statusCode === 200) {

      // Scrapping data with X-Ray
      x(html, '#divID0 > table > tr', {
        date: '.block90w',
        lat: 'td:nth-child(2)',
        lon: 'td:nth-child(3)',
        prof: 'td:nth-child(4)',
        mag: 'td:nth-child(5)',
        local: 'td:nth-child(6)',
        degree: 'td:nth-child(7)',
      })((error, obj) => {

        const result = {
          date: obj.date,
          lat: obj.lat.replace(',', '.'),
          lon: obj.lon.replace(',', '.'),
          prof: obj.prof == '-' ? null : obj.prof.replace(',', '.'),
          mag: obj.mag.replace(',', '.'),
          local: obj.local,
          degree: obj.degree,
        };

        // console.log(result);

        upsertEarthquake(result); // save to DB

      });

    }


  });

  }
  }
  }

I guess I have to use promises or callbacks but I can't understand how to do this, and I already tried using async await but with no success. If any additional info needs to be provided please tell me, thanks.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
Th3Guy
  • 21
  • 1
  • 8

3 Answers3

1

You are calling the request inside a loop.

Async functions are functions where getting the result (A.K.A., receiving a response in the callback function) is made after the main thread logic has ended.

This way, if we have this:

for (var i = 0; i < 12; i++) {
    request({
        data: i
    }, function(error, data) {
        // This is the request result, inside a callback function
    });
}

The logic will be to run over the 12 requests before calling the callbacks, so the callbacks will be stacked and called after all the main loop is run.

Without entering all the ES6 generators thing (As I think that it makes it a bit more complicated and also learning at low-level what is happening is better for you), what you have to do is to call the request, wait for his callback function to be called and call the next request. How to do that? There's a bunch of ways, but I usually go like this:

var i= 0;
function callNext() {
    if (i>= 12) {
        requestEnded();
    } else {
        request({
            data: i++ // Increment the counter as we are not inside a for loop that increments it
        }, function(error, data) {
            // Do something with the data, and also check if an error was received and act accordingly, which is very much possible when talking about internet requests
            console.log(error, data);
            // Call the next request inside the callback, so we are sure that the next request is ran just after this request has ended
            callNext();
        })
    }
}
callNext();

requestEnded() {
    console.log("Yay");
}

Here you see the logic. You have a function named callNext which will make the next call or call requestEnded if no more calls are needed.

When request is called inside callNext, it will wait for the callback to be received (which will happen asynchronously, sometime in the future), will process the data received and then inside the callback you tell him to call again callNext.

Jorge Fuentes González
  • 11,568
  • 4
  • 44
  • 64
  • I tried to implement that but the output still comes unordered, despite it does not hang anymore. You can see here what I did https://pastebin.com/J0PdG9rv – Th3Guy Dec 26 '17 at 22:06
  • @MiguelFerreira You got it. With this info you created a way to concatenate async callbacks and that's cool, but I still see a problem. You have an async request inside another async request. Inside the `request` callback you call `x`, which also has an async callback. You want the next request to be done after ALL the async callbacks are done. I've tweaked your pastebin a bit: https://pastebin.com/keshVjXx Check where I call the `getdata`, just after you got your desired result. At this point you are done and can continue with the next request. – Jorge Fuentes González Dec 26 '17 at 22:12
  • Notice how if you get an error or something that makes you don't call the `x` function, it won't continue with the next request. Think if this is something you actually want. Maybe you want to repeat the request, or start over again from start if this situation happens. That's up to you. Just add an `else` and add the code you want to run if the "request chain" breaks. – Jorge Fuentes González Dec 26 '17 at 22:13
  • Thanks jorge for the time and effort, your solution was simple and worked perfectly +1 for you! Edit: I cant upvote because I still have <15 of reputation but I will certainly do It when I achieve that mark, thanks again. – Th3Guy Dec 26 '17 at 23:40
  • @MiguelFerreira haha, no problem. Glad it worked :-) – Jorge Fuentes González Dec 26 '17 at 23:58
-1

Instead of looping you could create an array using start and end year then map that to config for your request, then map the result of that to what x-ray returns (x-ray returns a promise like so no need for callback). Then put the result of the scrape in mongodb using a function that returns a promise.

If something rejects then create a Fail type object and resolve with that object.

Start all requests, x-ray and mongo in parallel using Promise.all but limit the amount of active requests using throttle.

Here is what that would look like in code:

//you can get library containing throttle here:
//  https://github.com/amsterdamharu/lib/blob/master/src/index.js
const lib = require('lib');
const Fail = function(details){this.details=details;};
const isFail = o=>(o&&o.constructor)===Fail;
const max10 = lib.throttle(10);
const range = lib.range;
const createYearMonth = (startYear,endYear)=>
  range(startYear,endYear)
  .reduce(
    (acc,year)=>
      acc.concat(
        range(1,12).map(month=>({year,month}))
      )
    ,[]
  );
const toRequestConfigs = yearMonths =>
  yearMonths.map(
    yearMonth=>{
      const formData = querystring.stringify(yearMonth);
      return {
        headers: {
          'Content-Length': formData.length,
          'Content-Type': 'application/x-www-form-urlencoded',
        },
        uri: 'https://www.ipma.pt/pt/geofisica/sismologia/',
        body: formData,
        method: 'POST',
      };
    }
  );
const scrape = html =>
  x(
    html, 
    '#divID0 > table > tr', 
    {
      date: '.block90w',
      lat: 'td:nth-child(2)',
      lon: 'td:nth-child(3)',
      prof: 'td:nth-child(4)',
      mag: 'td:nth-child(5)',
      local: 'td:nth-child(6)',
      degree: 'td:nth-child(7)'
    }
  );
const requestAsPromise = config =>
  new Promise(
    (resolve,reject)=>
      request(
        config,
        (err,res,html)=>
          (!err && res.statusCode === 200) 
            //x-ray returns a promise:
            // https://github.com/matthewmueller/x-ray#xraythencb
            ? resolve(html)
            : reject(err)
      )
  );
const someMongoStuff = scrapeResult =>
  //do mongo stuff and return promise
  scrapeResult;
const getData = (startYear,endYear) =>
  Promise.all(
    toRequestConfigs(
      createYearMonth(startYear,endYear)
    )
    .map(
      config=>
        //maximum 10 active requests
        max10(requestAsPromise)(config)
        .then(scrape)
        .then(someMongoStuff)
        .catch(//if something goes wrong create a Fail type object
          err => new Fail([err,config.body])
        )
    )
  )
//how to use:
getData(1980,1982)
.then(//will always resolve unless toRequestConfigs or createYearMonth throws
  result=>{
    //items that were successfull
    const successes = result.filter(item=>!isFail(item));
    //items that failed
    const failed = result.filter(isFail);
  }
)

What happens a lot with scraping is that the target site does not allow you to make more than x requests for a y period and start blacklisting your IP and refusing service if you go over that.

Let's say you want to limit to 10 requests per 5 seconds then you can change above code to:

const max10 = lib.throttlePeriod(10,5000);

The rest of the code is the same

HMR
  • 37,593
  • 24
  • 91
  • 160
  • 1
    Thanks a lot for the time and effort but jorge was faster to answer and is solution was simpler. – Th3Guy Dec 26 '17 at 23:42
  • @MiguelFerreira No problem, let me know if you need help implementing this solution because you understand how throttled parallel relates to scraping or if you loose all successfully processed items because one failed or if you understand the difference between imperative and declarative programming. – HMR Dec 27 '17 at 01:41
-1

You have the sync for...loop with async methods inside it problem.

A clean way to fix this is with

ES2017 async/await syntax

Assuming that you want to stop every iteration after upsertEarthquake(result) you should change your code for something like this.

function async getdata() {
    const startYear = 1996;
    const currentYear = 1998; // new Date().getFullYear()

    for (let i = startYear; i <= currentYear; i++) {
        for (let j = 1; j <= 12; j++) {
            if (i === startYear)
                j = 12; 

            // Form to be sent
            const form = {
                year: `${i}`,
                month: `${j}`,
                day: '01',
            };

            const formData = querystring.stringify(form);
            const contentLength = formData.length;
            //Make HTTP Request
            await new Promise((next, reject)=> { 
                request({
                    headers: {
                        'Content-Length': contentLength,
                        'Content-Type': 'application/x-www-form-urlencoded',
                    },
                    uri: 'https://www.ipma.pt/pt/geofisica/sismologia/',
                    body: formData,
                    method: 'POST',
                }, (err, res, html) => {
                    if (err || res.statusCode !== 200)
                        return next() //If there is an error jump to the next

                    //Scrapping data with X-Ray
                    x(html, '#divID0 > table > tr', {
                        date: '.block90w',
                        lat: 'td:nth-child(2)',
                        lon: 'td:nth-child(3)',
                        prof: 'td:nth-child(4)',
                        mag: 'td:nth-child(5)',
                        local: 'td:nth-child(6)',
                        degree: 'td:nth-child(7)',
                    })((error, obj) => {
                        const result = {
                            date: obj.date,
                            lat: obj.lat.replace(',', '.'),
                            lon: obj.lon.replace(',', '.'),
                            prof: obj.prof == '-' ? null : obj.prof.replace(',', '.'),
                            mag: obj.mag.replace(',', '.'),
                            local: obj.local,
                            degree: obj.degree,
                        }
                        //console.log(result);
                        upsertEarthquake(result); // save to DB
                        next() //This makes jump to the next for... iteration
                    })

                }) 
            }
        }
    }
}

I assume that upsertEarthquake is an asynchronous function or is the type fire and forget.

If there is an error you can use the next(), but if you want to break the loop, use reject()

if (err || res.statusCode !== 200)
    return reject(err)
Fernando Carvajal
  • 1,869
  • 20
  • 19