0

I'm trying to calculate the flight time between every country in hopes to create parameters for a fraud prevention tool I'm scoping out.

The website url I'm working with is https://www.travelmath.com/flying-time/from/Canada/to/Germany

The third column I have is replacing the two country references with all the possible combinations.

I'm trying to use RVEST with a loop to do this but keep receiving various errors. I've looked on stack to try and fit other solutions to my problem but have run into numerous issues. Lastly I'm trying to create a loop that doesn't blast the website I'm querying with 55225 requests in a short window.

Here's the most recent solution I've tried but I keep getting the following errors in repetition

I've tried re-arranging my data frame and handling the replacement of the origin and destination.

I've tried using Rselenium to do this but ran into other issues as well.

I've tried re-formatting other solutions to similar problems but still receive errors.

tables <- list()
index <- 1
for (i in CountryPairs){
    try(
        {
            url <- paste0("https://www.travelmath.com/flying-time/from/",i)
            table <- url %>%
            read_html()%>%
            html_nodes("#flyingtime")

            tables[index] <- table

            index <- index +1
        }
    )
}
df<-do.call("rbind",tables)

Error in open.connection(x, "rb") : HTTP error 400.

Error in tables[index] <- table : replacement has length zero

Community
  • 1
  • 1
Jabez
  • 11
  • 3
  • An HTTP Error 400 means the request was incorrect. Could you please provide a sample from CountryPairs so that we can test your code? Did you try to print your `url` variable to see if it's correct? – Biblot Oct 17 '19 at 09:04

1 Answers1

0

I took a list of countries to build your CountryPairs variable and used your code to come up with this. The tables variable gets filled with flight times as character vectors. Since you got some HTTP 400 errors, I think the problem lies in the way you generate the CountryPairs variable, thus creating a bad request.

library(dplyr)
library(rvest)

# Vector of countries
countries <- c(
  "Afghanistan",
  "Albania",
  "Algeria",
  "Andorra",
  "Angola",
  "Argentina",
  "Armenia",
  "Australia",
  "Austria",
  "Azerbaijan"
)

# Build all combinations of two countries
countries_combinations <- combn(countries, 2)

# Build the country pairs as "Country1/to/Country2" for the request to travelmath
country_pairs <- apply(countries_combinations, 2, function(x) paste(x, collapse = "/to/"))

tables <- list()
index <- 1
for (c_pair in country_pairs){
  try(
    {
      url <- paste0("https://www.travelmath.com/flying-time/from/", c_pair)

      # Get the flight time from the #flyingtime h3 tag
      table <- url %>%
        read_html %>%
        html_nodes("#flyingtime") %>%
        html_text

      tables[index] <- table

      index <- index + 1
    }
  )
}

EDIT: To remove unused connections, the only solution I found was on this stack overflow thread. You can call the function:

CatchupPause <- function(secs){
 Sys.sleep(secs) # pause to let connection work
 closeAllConnections()
 gc()
} 

at the end of your for loop, with secs = 3, to make sur the connections properly close.

Biblot
  • 695
  • 3
  • 18
  • Would close all connections at the end help with the following error? In .Internal(gc(verbose, reset, full)) : closing unused connection 3 – Jabez Oct 17 '19 at 10:21