I've been trying to build up a tool that needs to fetch all files' URLs of GitHub code search's result. For example when you go the here and search for uber.com api_key
. You'll see that there is 381 code results and I want to get all these 381 files' URLs.
In order to do that I learned how to use GitHub API V3 and made following function:
def fetchItems(search, GITHUB_API):
items = set()
response = {"items":[1]}
pageNumber = 1
while(response["items"]):
sleep(3) # trying to avoid rate limit, not successful though :(
url = "https://api.github.com/search/code"
params = {
"q" : search,
"per_page" : 30, # default value, it can be increased to 100
"page" : pageNumber
}
headers = {
"Accept" : "application/vnd.github+json",
"Authorization" : f"Bearer {GITHUB_API}"
}
r = requests.get(url=url, headers=headers, params=params, verify=False)
if r.status_code == 403: # if we exceed the rate limit, sleep until rate limit get reseted
epochReset = int(r.headers["X-Ratelimit-Reset"])
epochNow = time()
if epochNow < epochReset:
sleep((epochReset - epochNow) + 1)
sleep(1)
continue
response = json.loads(r.text)
for file in response["items"]:
items.add(file["html_url"])
pageNumber += 1
return items
page variable indicates the number of items that'll be returned in each page, and page is the page :). By increasing page number in every request, you should be able to get all items according to my understanding.
However when I opened my database and checked the items that have been written, I saw that there were only 377 files, so 4 of the files are missing.
Because of my repuation I can't post images, so click here.
I checked the db writer function and I'm sure that there is nothing wrong with that. Does GitHub API return missing items in JSON or am I doing something wrong ?