Questions tagged [github-archive]

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

GitHub provides 18 event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly archives, which you can access with any HTTP client.

Each archive contains a stream of JSON encoded GitHub events, which you can process in any language.

GitHub Archive dataset is also available via Google BigQuery.

Online resources:

24 questions
1
vote
1 answer

Parsing githubarchive responses

I'm trying to work on an entry to the Github Data Challenge and I'm trying to analyze a set of PushEvents, but I'm getting some strange(?) results. users = Hash.new(0) (0..23).each do |hour| gz =…
tanookiben
  • 22,575
  • 8
  • 27
  • 25
0
votes
0 answers

Getting all files with a specific ending in the Github Archive Bigquery, but ignoring forks

I want to download all files with a specific ending from Github and use the Github Bigquery archive to achieve this. With some help I already have this code, which kind of works: SELECT f.repo_name, f.path, content.copies, content.size,…
Pux
  • 421
  • 3
  • 18
0
votes
1 answer

Download githubarchive data with php and httpclient

i'm trying to download gz file locally from githubarchive with httpclient in php. When i execute a wget in terminal, the gz is extracted and each folders are downloaded on my computer. When i do the same in php code, i encounter a 404 each…
chaillouvincent
  • 197
  • 1
  • 3
  • 17
0
votes
0 answers

Retrieving languages and stargazers of GitHub repos

I'm new to SQL and GitHubArchieve and trying to get the list of languages and stargazers of the popular repositories on GitHub. The information I'm looking for are repo id, repo languages (languages + percentage), repo stargazers (and their…
MonMon
  • 1
  • 1
0
votes
1 answer

Missing data in Github Archive on Big Query?

Missing data in Github Archive on Big Query? Using BigQuery's tables from the Github Archive, and running a query on pull-requests for the typelevel/cats repo, there's no entries prior to 1/1/2016, despite the actual repo showing activity beginning…
anjarp
  • 67
  • 1
  • 6
0
votes
1 answer

GitHub Archive - Issues with retrieving data with ranges

I am trying to retrieve data from [GitHub Archive]: https://www.githubarchive.org/ and is having trouble retrieving data when I add a range. It works when I use http://data.githubarchive.org/2015-01-01-15.json.gz, but getting a `open_http': 404 Not…
Andy Kwong
  • 197
  • 8
0
votes
1 answer

Why consecutive event jsons fall on the same line in some packages in githubarchive?

In http://www.githubarchive.org/ that Ilya Grigorik has provided ,I found that in many gz files , some consecutive events are logged to same file . for example in 2011-03-15-21.json.gz To get the above do : wget…
Harish Kayarohanam
  • 3,886
  • 4
  • 31
  • 55
0
votes
2 answers

Convert 10,000+ JSON files into one single SQLite db?

Ok so I wanted to build a simple web app, that somehow would use githubarchive data. AT first I though of using the BigQuery database and it's API, however, my free quota would be over in just a day. So, what I've done is download all 2012/2013…
KGo
  • 18,536
  • 11
  • 31
  • 47
0
votes
1 answer

Getting data from GitHub Archive

I tried to get historical data from GitHub Archive by entering http://data.githubarchive.org/2012-04-15.json.gz, but I got no data. How do I get data about activity on GitHub?
Anthony
  • 3,990
  • 23
  • 68
  • 94
1
2