Questions tagged [github-archive]

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

GitHub provides 18 event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly archives, which you can access with any HTTP client.

Each archive contains a stream of JSON encoded GitHub events, which you can process in any language.

GitHub Archive dataset is also available via Google BigQuery.

Online resources:

24 questions
4
votes
2 answers

How to measure language popularity via Github Archive data?

I'm attempting to measure programming language popularity via: The number of stars on repos in combination with... The programming languages used in the repo and... The total bytes of code in each language (recognizing that some languages are…
Abe
  • 156
  • 5
  • 17
3
votes
1 answer

How to obtain java repositories having maximum number of stars in GitHub-Archive

I am currently trying to obtain the top 100 java repositories having maximum number of stars and less than 100 commits using GitHub Archive and BigQuery. Could you please help to come up with a query for obtaining the top 100 repositories having…
user2475467
  • 129
  • 1
  • 10
3
votes
1 answer

Tracing the growth of top 100 repositories on GitHub?

I am trying to trace the growth of the top 100 repositories on GitHub. I have the following query: SELECT MAX(repository_forks) as forks, repository_url FROM [publicdata:samples.github_timeline] WHERE (created_at CONTAINS "2012-04-01") GROUP BY…
histelheim
  • 4,938
  • 6
  • 33
  • 63
2
votes
1 answer

If you archive a github pages repo will the content at github.io persist?

I have a project that I was to archive, which is https://github.com/scriptish/scriptish.github.com This repo uses Github Pages to generate static content at http://scriptish.github.io/ which I don't want to disappear. So my question if I archive the…
erikvold
  • 15,988
  • 11
  • 54
  • 98
2
votes
1 answer

Why does the number of forks in Github Archive on Big Query not match the UI?

I am trying to get various Github repo metrics in Github Archive through Big Query(doc here). However, when I try to count the number of forks, the number I am getting is very different from the number of forks specified in the Github UI. For…
walker_4
  • 433
  • 1
  • 7
  • 21
2
votes
2 answers

How to search github projects ordered by number of commits?

I was thinking of trying out BigQuery and GithubArchive, but I'm not sure how to compose a query that would let me search for a term in code or project and order the results by number of commits descending. Thanks for any tips
slashdottir
  • 7,835
  • 7
  • 55
  • 71
2
votes
1 answer

How to obtain java repositories having maximum number of stars and less than 100 commits

I am currently trying to obtain the top 100 java repositories having maximum number of stars and less than 100 commits using GitHub Archive and BigQuery. Could you please help to come up with a query for this purpose. The initial query I have…
user2475467
  • 129
  • 1
  • 10
2
votes
1 answer

Google BigQuery SQL Statement

I am trying to get some data from the GitHub Archive using Google Big Query. The current amount of data I am requesting is too much for BigQuery to process (at least in the free tier) so I am trying to limit the scope of my request. I want to limit…
Ankush Agrawal
  • 395
  • 2
  • 14
2
votes
2 answers

on github, Is there a way to find the connection between issues and pullrequest, between issues and commits,etc

On the github website, a lot of issues are connected(referenced) with pull request or commits. Is there a way we can find the connection in the github archive database or in the github API?
2
votes
1 answer

How far can one retrieve data from GitHub Archive?

The GitHub Archive project states GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. This archive is also queryable through Google Big Query. However, it looks like…
nulltoken
  • 64,429
  • 20
  • 138
  • 130
1
vote
1 answer

encryption of email in GitHub Archive

I queried GitHub archive and got the mail encrypted. The mail that I got in the query - 9b2aaf20c3f2c0c9b21ada60e9bca6ef34b3dbc7@outlook.com The mail it suppose to be - phil12328@outlook.com Anyone knows how to decrypt it?
Tal Folkman
  • 2,368
  • 1
  • 7
  • 21
1
vote
0 answers

Monitoring changes in collaborators on github

I have a pet project where I am trying to get some stats about collaborators (core team members, as per https://github.com/CoolProp/CoolProp/wiki/Contributors-vs-Collaborators). Basically I want to know when people were added to a repo. Since I want…
goodie_oh
  • 43
  • 4
1
vote
2 answers

How to pull github timeline data from BigQuery

I am having trouble accessing the GitHub timeline from BigQuery. I was using the following query: SELECT repository_name, actor_attributes_company, payload_ref_type, payload_action, type, created_at FROM githubarchive:github.timeline WHERE…
Alex Jauch
  • 58
  • 5
1
vote
1 answer

Getting latest repository infos from Github Archive

I want to retrieve the latest infos about a repository using Google Big Query on the github archive timeline dataset. I tried to join on max(created_at) but i get vastly incomplete informations. Here is the query for the rails repo : SELECT * FROM…
vdaubry
  • 11,369
  • 7
  • 54
  • 76
1
vote
1 answer

Google BigQuery: How do I get a distinct row for a value in query results

I am trying to use Google BigQuery on the github archive (http://www.githubarchive.org/) data to get the statistics for repositories at the time of their latest event and I am trying to get this for the repositories with the most watchers. I realize…
brycek
  • 23
  • 7
1
2