0

I've been trying to build up a tool that needs to fetch all files' URLs of GitHub code search's result. For example when you go the here and search for uber.com api_key. You'll see that there is 381 code results and I want to get all these 381 files' URLs.

In order to do that I learned how to use GitHub API V3 and made following function:

def fetchItems(search, GITHUB_API):   
    
    items = set()
    response = {"items":[1]}
    pageNumber = 1
    
    while(response["items"]):
        
        sleep(3) # trying to avoid rate limit, not successful though :(

        url = "https://api.github.com/search/code"
        params = {
            "q" : search,
            "per_page" : 30, # default value, it can be increased to 100
            "page" : pageNumber
        }  
        headers = {
            "Accept" : "application/vnd.github+json",
            "Authorization" : f"Bearer {GITHUB_API}"
        }

        r = requests.get(url=url, headers=headers, params=params, verify=False)
        
        if r.status_code == 403: # if we exceed the rate limit, sleep until rate limit get reseted
            epochReset = int(r.headers["X-Ratelimit-Reset"])
            epochNow = time()

            if epochNow < epochReset:
                sleep((epochReset - epochNow) + 1)
            
            sleep(1)
            continue
        
        response = json.loads(r.text)
    
        for file in response["items"]:
            items.add(file["html_url"])
        
        pageNumber += 1
    
    return items

page variable indicates the number of items that'll be returned in each page, and page is the page :). By increasing page number in every request, you should be able to get all items according to my understanding.

However when I opened my database and checked the items that have been written, I saw that there were only 377 files, so 4 of the files are missing.

Because of my repuation I can't post images, so click here.

I checked the db writer function and I'm sure that there is nothing wrong with that. Does GitHub API return missing items in JSON or am I doing something wrong ?

0 Answers0