4

I am trying to fetch list of all repositories from Github to do some analysis on it. I have started my job with their v3.0 API which is a Restful one and then when I needed more info like star count, migrated from v3.0 to v4.0 which is provided as GraphQL. Now I am making request for 100 records each time and doing this recursively to be able to fetch all records.

The problem is about pagination job. To have pagination work, I have to get endCursor of each request and then in the next request, I have to fill after property with this value. Now the problem is that data is not paginated properly. For example:

  1. Requesting first page (without any cursor) results in different records.
  2. Requesting a page with same cursor for multiple times, also retrieves different results.
  3. And if simply not check this, and try to fetch on after another, each 100 records, have many duplicates with previous requests, which means pagination does not work correctly.

The query that I am sending (in a nodejs app) is as below:

{
  search(query: "is:public", type: REPOSITORY, first: 100, after: "Y3Vyc29yOjEwMA==") {
    repositoryCount
    userCount
    wikiCount
    pageInfo {
      startCursor
      endCursor
      hasNextPage
      hasPreviousPage
    }
    edges {
      node {
        ... on Repository {
          databaseId
          id
          name
          description
          forkCount
          isFork
          issues {
            totalCount
          }
          labels (first: 100) {
            nodes {
              name
            }
          }
          languages (first: 100) {
            nodes {
              name
            }
          }
          licenseInfo {
            name
          }
          nameWithOwner
          primaryLanguage {
            name
          }
          pullRequests {
            totalCount
          }
          watchers {
            totalCount
          }
          stargazers {
            totalCount
          }
        }
      }
    }
  }
}

as I have previously said, first time, I remove the parameter after from the search inputs, and then use endCursor of previous request as the after param of next one.

Am I miss understanding the cursor purpose and its usage or is this a bug (intended/unintended) from Github itself?

ConductedClever
  • 4,175
  • 2
  • 35
  • 69
  • 2
    I suspect the issue is that `search` doesn't expose a way to sort the results. If it's like the REST API, the default sort [may be the search ranking](https://developer.github.com/v3/search/#ranking-search-results), which is meaningless when your query is `"is:public"`. You can see more consistency in the results when you add a specific keyword to the search. Part of the issue is that you're also querying a very rapidly changing dataset. If you include a specific keyword in your dataset, you'll get more consistency in the results. – Daniel Rearden Oct 07 '19 at 12:03
  • 1
    @DanielRearden I think github documents are very confusing and weak about this topic but if you are right, then is there a way to put a search term before `is:public` to force results be ordered according to a consistent way like their id? I have tried `databaseId:>0 is:public` but result goes empty. – ConductedClever Oct 09 '19 at 05:44

1 Answers1

4

Fortunately I have found a way that works for now. And very thanks to @Daniel Rearden for his very helpful tip. I have tested many query strings and found that, if I request an specific create date, then the data is being sorted according to this field and in my tests, now the order stays consistent and the cursor will have meaning.

The query is now this:

{
  search(query: "created:2008-02-08 is:public", type: REPOSITORY, first: 100) {
    repositoryCount
    userCount
    wikiCount
    pageInfo {
      startCursor
      endCursor
      hasNextPage
      hasPreviousPage
    }
    edges {
      node {
        ... on Repository {
          databaseId
          id
          name
          description
          forkCount
          isFork
          issues {
            totalCount
          }
          labels (first: 100) {
            nodes {
              name
            }
          }
          languages (first: 100) {
            nodes {
              name
            }
          }
          licenseInfo {
            name
          }
          nameWithOwner
          primaryLanguage {
            name
          }
          pullRequests {
            totalCount
          }
          watchers {
            totalCount
          }
          stargazers {
            totalCount
          }
          createdAt
          updatedAt
          diskUsage
        }
      }
    }
  }
}

Now the only thing I should think about is to scroll over days and make this query many times on each day, until pageInfo.hasNextPage is true.

By now I have not tested this for all ~4000 days and may be I can't verify that the fetched result is all data that exists in their DB, but it seems to be the best solution.

ConductedClever
  • 4,175
  • 2
  • 35
  • 69