2

I am trying to get information about repositories using the github API. I am using R for this. Some urls throw 403 errors. Unfortunately, this stops my function and breaks the fromJSON function. Calling fromJSON again will always result in "client error: (403) Forbidden"

Is there a way to handle exceptions in R so my function can continue executing if I get a 403.

My function is as follows:

getData <- function(start, end) {
  languages = NULL 
  names = NULL
  base_url <- 'https://api.github.com/users/'
  for(num in start:end) {
    url <- print(paste(base_url,num, '/repos', sep='')) 
    df<- fromJSON(url)
    languages <- c(languages, df$language)
    names <- c(names, df$name)
  }
  r = data.frame(languages, names)
  return(r)
}
user2977636
  • 2,086
  • 2
  • 17
  • 21
  • do you need to scrape it this way? R has some github api packages including this one https://github.com/cscheid/rgithub – hrbrmstr Jun 27 '15 at 11:24
  • Your `403 Forbidden` is most likely the GitHub API telling you that you exceeded your non-authenticated API limit, btw. – hrbrmstr Jun 27 '15 at 11:40

3 Answers3

4

As I suggested in my comment, you may be better off using the GH API from one of the R packages that implements it. However, if you are determined to build it from scratch, the following code:

  • uses the built in JSON->R decoding that httr gives you for free
  • checks for valid response codes
  • accounts for potentially missing fields in the return value
  • uses data.table for both efficiency and easier handling of data frame building

It also gives you progress bars for free with pbapply.

library(httr)
library(data.table)
library(pbapply)

get_data <- function(start, end) {
  base_url <- 'https://api.github.com/users/%d/repos'
  pblapply(start:end, function(i) {
    resp <- GET(sprintf(base_url, i))
    warn_for_status(resp)
    if (status_code(resp) == 200) {
      dat <- content(resp, as="parsed")
      data.table(name=sapply(dat, function(x) ifelse(is.null(x[["name"]]), NA, x[["name"]])),
                 language=sapply(dat, function(x) ifelse(is.null(x[["language"]]), NA, x[["language"]])))
    } else {
      data.table(language=NA, name=NA)
    }
  })
}

gh <- rbindlist(get_data(1, 6))

gh
##                       name     language
##  1: python-youtube-library       Python
##  2:                      t           NA
##  3:               dotfiles         VimL
##  4:               pair-box           NA
##  5:           6.github.com   JavaScript
##  6:             AndAnd.Net           C#
##  7:         backbone-tunes   JavaScript
##  8:            battletower CoffeeScript
##  9:              BeastMode         Ruby
## 10:   blurry_search.coffee   JavaScript
## 11:              bootstrap          CSS
## 12:     browser-deprecator   JavaScript
## 13:            classify.js   JavaScript
## 14:          cocoa-example  Objective-C
## 15:               Colander CoffeeScript
## 16:        comic_reader.js   JavaScript
## 17:            crawl-tools       Python
## 18:            CS-Projects       Python
## 19:                cssfast CoffeeScript
## 20:               danbooru         Ruby
## 21:                    Dex CoffeeScript
## 22:             dnode-ruby         Ruby
## 23:             domain-gen         Ruby
## 24:            domainatrix         Ruby
## 25:                Doodler         Java
## 26:               dotfiles         VimL
## 27:                 dothis         Ruby
## 28:             elixir-web       Elixir
## 29:           faster_manga CoffeeScript
## 30:                 favmix         Java
## 31:                 fluent         Ruby
## 32:       fluid-image-grid   JavaScript
## 33:               freeform         Ruby
## 34:           FreeYourCode         Ruby
##                       name     language

Go easy on the free API access. This code will warn you if it gets a 403 but keep processing (you can change that with a stop_for_status vs warn_for_status or just test and stop on your own). You'll end up with incorrect NAs that way.

IMO it would be far more advantageous to use the authenticated API access.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
1

See ?try and ?tryCatch for a guide to exception handling.

Here's a POC that shows how a 404 error from fromJSON can be made to continue, going on to print "ok":

> try({fromJSON("http://www.google.com/nosuch")}) ; cat("ok\n")
Error in download_raw(txt) : client error: (404) Not Found
ok

You can test the return from try to see if the code raised an error. See help pages for more.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
1

Here's a better way of doing it using httr to process the requests. It also uses plyr for the main loop. Invalid entries have NAs in them. You could remove those later if desired.

library("httr"); library("plyr"); library("jsonlite")
getData <- function(start, end) {
  base_url <- "https://api.github.com/users/"
  ldply(start:end, function(num) {
    cat(url <- paste0(base_url,num, "/repos"), "\n")
    resp <- GET(url)
    if (status_code(resp) == 200) {
      df <- fromJSON(content(resp, "text"))
      out <- data.frame(language = NA, name = NA)
      if (length(df) > 0) {
        out <- df[, c("language", "name")]
      }
    }
    out
  })
}
Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52