2

I am trying to get various Github repo metrics in Github Archive through Big Query(doc here). However, when I try to count the number of forks, the number I am getting is very different from the number of forks specified in the Github UI. For instance when I run this sql script:

SELECT repo.url,repo.name , COUNT(*) fork_count, 
FROM [githubarchive:year.2011],
  [githubarchive:year.2012],
  [githubarchive:year.2013],
  [githubarchive:year.2014],
  [githubarchive:year.2015],
  [githubarchive:year.2016],
  [githubarchive:year.2017],
  [githubarchive:year.2018],
  [githubarchive:month.201901]
WHERE type='ForkEvent'
and repo.url like 'https://github.com/python/cpython'
GROUP BY 1,2

I get a result of:

Row repo_url                           repo_name   fork_count    
1   https://github.com/python/cpython   cpython    177

However when I go to the URL 'https://github.com/python/cpython' I see that there are 8,198 forks. What is the reason for this discrepancy?

EDIT:

Felipe pointed out below that there could be multiple URL's for the same repo. Felipe's output

However, even with multiple URLS, the number was not a exact match to the UI and this time was substantially larger then the UI's number. Is there any way to get an exact match?

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
walker_4
  • 433
  • 1
  • 7
  • 21
  • @joan-grau - I am just curious - what is your logic for adding google-cloud-platform tag to questions with google-bigquery tag? I see you started doing this recently and this makes no much sense to me. – Mikhail Berlyant Jan 10 '19 at 08:56
  • @Mikhail-Berlyant It's to let the GCP support team to acknowledge the questions (and answers if there are any), making them more likely to get them resolved :) – Joan Grau Noël Jan 10 '19 at 09:07
  • @joan-grau - this makes no sense so I will be removing this tag (when it is purely sql / bigquery related) – Mikhail Berlyant Jan 10 '19 at 14:28

1 Answers1

1

What are you querying for? Notice you'll get different results depending if you go for the repo id, name, or url:

#standardSQL
SELECT repo.name, repo.id, repo.url, COUNT(*) c
FROM `githubarchive.month.201*`
WHERE type='ForkEvent'
AND (
  repo.id = 81598961 
  OR repo.name='python/cpython'
  OR repo.url like 'https://github.com/python/cpython'
)
GROUP BY 1,2,3

enter image description here

If you want to know "when?":

#standardSQL
SELECT repo.name, repo.id, repo.url, COUNT(*) c
  , MIN(DATE(created_at)) since, MAX(DATE(created_at)) until
FROM `githubarchive.month.201*`
WHERE type='ForkEvent'
AND (
  repo.id = 81598961 
  OR repo.name='python/cpython'
  OR repo.url like 'https://github.com/python/cpython'
)
GROUP BY 1,2,3
ORDER BY since

enter image description here

EDIT:

GitHub only lists one fork per user - so if you want to remove duplicates do COUNT(DISTINCT actor.id) which brings it down to ~9k.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Thanks Felipe, I am trying to do match the number of forks shown on the UI. If that takes multiple URL's that is fine by me, however it seems that with your query, the number of forks is substantially higher then the number on the UI. Would you know why? – walker_4 Jan 10 '19 at 01:53
  • 1
    GitHub only lists one fork per user - so if you want to remove duplicates do `COUNT(DISTINCT actor.id)`. It goes down to 9k, which is closer to the reported number. Maybe GitHub also decreases the count when an user deletes their account or their forks, which doesn't change the number of times that something has happened - which is what we are counting here. – Felipe Hoffa Jan 10 '19 at 04:32
  • Thanks, think this pretty much answers my question! If you found this info in a doc, would you be able to link it? – walker_4 Jan 10 '19 at 17:11
  • Stack Overflow is the docs :). Please remember to upvote/accept answers if they are an actual answer to your question! – Felipe Hoffa Jan 10 '19 at 22:08