3

I'm looking for a solution to randomly sample repos from Github. The final result is to perform some data analysis on the sample.

What I would like to do is sample by the repository's id: sample an int between 0 and 2.7 million and find the associated repo. After I have the username/repo-name, I'll use the api to get details.

The problem is I do not know how to search by repo id. Any suggestions? I'm open to webscraping or Python solutions.

Cam.Davidson.Pilon
  • 1,606
  • 1
  • 17
  • 31
  • not sure if it helps but you can access user by int id via the rest api. Then you can access any repository by that random user. – three Feb 24 '13 at 18:47

1 Answers1

2

You can use python to access GitHUb V3 Api (as in "Most suitable python library for Github API v3").

And you can access GitHub repos, from a certain id (GET /repositories, with as parameter, integer ID of the last Repository that you’ve seen: so that can provide a roundabout way to access repos with their id.

Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • I do not understand you solution, do you mind expanding it some more? In particular, what does "the last Repository that you’ve seen" mean? – Cam.Davidson.Pilon Feb 24 '13 at 19:47
  • @Cam.Davidson.Pilon It means it will list all repos starting the certain id. In your case, you can chose only the first one as a way to access a repo per its id. – VonC Feb 24 '13 at 20:22
  • So for example, https://api.github.com/repositories?ID=50000 should return IDs >= 50000. But (at least for me), this url does not do that. – Cam.Davidson.Pilon Feb 24 '13 at 20:25
  • @Cam.Davidson.Pilon not `?ID=50000`, but `?since=50000`: then name of the parameter is '`since`'. See https://api.github.com/repositories?since=50000 – VonC Feb 24 '13 at 20:30
  • Any way to find what the most recent ID is globally (to sample uniformly from all repositories)? Besides some kind of binary search? – ondra.cifka Jun 06 '20 at 09:35
  • 2
    @ondra.cifka 7+ years later, any search involving multiple repositories would be done by BigQueries (https://codelabs.developers.google.com/codelabs/bigquery-github/index.html?index=..%2F..index#0) or possibly with the GraphQL GitHub API v4 (https://developer.github.com/v4/) – VonC Jun 06 '20 at 09:42