1

I'm trying to access all repositories that have more than 5000 stars on Github. I've written this scraper to work with Node.js (it's running on a Cloud9 environment):

var request = require('request');
var fs = require('fs');

var options = {

    url: 'https://api.github.com/repositories',
    headers: {
    'User-Agent': 'myusernamehere'
    },
    qs: {
    stargazers: 5000
    }

};

function callback(error, response, body) {
  if (!error && response.statusCode == 200) {
    console.log(response.headers);

    fs.writeFile('output_teste.json', body, function (err) {
      if (err) throw err;
      console.log('It\'s saved!');
      console.log(response.statusCode);
    });

  } else {
    console.log(response.statusCode);
  }
}

request(options, callback);

But the result is not all of the repositories, just the first page of all of them. How can I use pagination with the Request module? I've tried to find examples within the documentation, but they aren't that clear. Or do I need to do this with another library or maybe another language?

Thanks!

TessavWalstijn
  • 1,698
  • 1
  • 19
  • 36

2 Answers2

1

you should modify your querystring to include the value of "since". You can read more on the github documentation. https://developer.github.com/v3/repos/#list-all-public-repositories

Sample URL with query string of since

https://api.github.com/repositories?since=364

bumblebeen
  • 642
  • 8
  • 21
0

You could use the pagination data provided in response.headers.link that's received when making calls to the GitHub API to find out if there are any more pages left for your call.

One approach is to loop through the pages until there are no more new pages left, at which point you can write to file and return from function.

On each loop you can add to the data that you already have by using concat (I assume that the response body is delivered as an array) and then passing on the data to the next function call.

I rewrote your code to include a basic implementation of such a technique:

var request = require('request');
var fs = require('fs');

var requestOptions = function(page) {
  var url = 'https://api.github.com/repositories?page=' + page;

  return {
    url: url,
    headers: {
      'User-Agent': 'myusernamehere'
    },
    qs: {
      stargazers: 5000
    }
  };
};

function doRequest(page, incomingRepos) {
  request(requestOptions(page), function(error, response, body) {
    if (!error && response.statusCode == 200) {
      console.log(response.headers);

      var currentPageRepos = JSON.parse(body);
      var joinedRepos = incomingRepos.concat(currentPageRepos);

      var linkData = response.headers.link;

      // if response does not include reference to next page
      // then we have reached the last page and can save content and return
      if (!(linkData.includes('rel="next"'))) {
        fs.writeFile('output_teste.json', JSON.stringify(joinedRepos), function(err) {
          if (err) throw err;
          console.log('It\'s saved!');
        });
        return;
      }

      page++;
      doRequest(page, joinedRepos);
    } else {
      console.log(response.statusCode);
    }
  });
}
doRequest(1, []);
ilokhov
  • 645
  • 1
  • 4
  • 11