Handling inconsistent test results when scraping a fluctuating website

Question

Would appreciate advice from the community on how best to handle an aggravating situation.

I have a R package that scrapes the National Hurricane Center archives and returns tidy storm data. The website often does not respond.

An example of this would be this Appveyor failure and then the subsequent pass (same branch, same commit).

Right now it has failed on four consecutive tests:

Builds 1.0.141 and 1.0.143 pass the first 119 tests. Build 1.0.142 passes the first 142 tests. Build 1.0.144 fails only after 66 tests. The errors are consistent:

Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached
Calls: test_check ... request_fetch -> request_fetch.write_memory -> -> .Call

I have so many tests because there are minor discrepancies or typos in several of the products that are scraped. So, when I modify a regex pattern to accomodate these discrepancies I want to ensure I don't inadvertently break something else.

I have added options for multiple attempts, delays between requests and timeout options. Unfortunately they do not seem to have helped much, if at all.

My question to the community: can you offer advice or suggestions as a better way to handle this situation? I know it's bad form to keep tests isolated from your production environment. But I don't need all of these tests for the entire package.

How would you handle these issues?

score 1 · Answer 1 · answered Jul 09 '17 at 01:09

If you're OK with sporadically skipping the tests when you get a timeout then I'd capture the timeout like they did here and just skip the test if the response indicates there was a timeout. Something like:

safe_GET <- purrr::safely(httr::GET)

skip_if_cant_get <- function(url) {
  resp <- safe_GET(url)

  if (is.null(resp$result)) {
    testthat::skip(paste0("Couldn't get ", url, " for testing"))
  }

  resp
}

resp <- skip_if_cant_get("http://www.nhc.noaa.gov/archive/1998/1998archive.shtml")
httr::http_status(resp$result)
#> $category
#> [1] "Success"
#> 
#> $reason
#> [1] "OK"
#> 
#> $message
#> [1] "Success: (200) OK"

# The timeout on this one will trigger the skip()
# resp <- skip_if_cant_get("http://deadbeefdeadbeef.org/")

This approach will stop the builds from breaking.

Handling inconsistent test results when scraping a fluctuating website

1 Answers1