2

The GitHub Archive project states

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

This archive is also queryable through Google Big Query. However, it looks like that I'm either missing something or only a portion of the data is available.

Indeed, running the following query only returns 1636 WatchEvents (started or stopped), whereas the Rails repository accounts more than 14300 watchers.

SELECT actor_attributes_login, created_at, payload_action
FROM [githubarchive:github.timeline]
where repository_name = "rails"
and type="WatchEvent"
order by created_at asc;

It looks like the oldest retrieved piece of data is more or less 2.5 months old.

Would the data be truncated (which might seem strange for an archive)? Is there a limit/quota I wouldn't know of related to the use of BigQuery?

github-archive

nulltoken
  • 64,429
  • 20
  • 138
  • 130

1 Answers1

7

That's correct. The project/crawler went live on March 11th of this year, hence the current archive starts on that day. There is a note about this on the githubarchive.org page, but I guess I should make it more visible and explicit.

There is a thread with the GitHub team about making more of their history available, but I don't have an ETA for it yet. fingers crossed :-)

igrigorik
  • 9,433
  • 2
  • 29
  • 30
  • 1
    Thx for this answer and for the *awesome* GitHub Archive initiative! – nulltoken May 24 '12 at 19:23
  • You're right. There's a note I've overlooked which states *"timeline data is available starting March 11, 2012."*. I also think this statement *might* deserve some more exposure ;) – nulltoken May 24 '12 at 19:29