tryCatch function works on most non-existent URLs, but it does not work in (at least) one case

Question

Dear Stackoverflow users,

I am using R to scrape profiles of a few psycotherapists from Psychology Today; this is done for exercising and learning more about web scraping.

I am new to R and I I have to go through this intense training that will help me with a future projects. It implies that I might not know precisely what I am doing at the moment (e.g. I might not interpret well either the script or the error messages from R), but I have to get it done. Therefore, I beg your pardon for possible misunderstandings or inaccuracies.

In short, the situation is the following. I have created a function through which I scrape information from 2 nodes of psycotherapists' profiles; the function is showed on this stackoverflow post.

Then I create a loop where that function is used on a few psycotherapists' profiles; the loop is in the above post as well, but I report it below because that is the part of the script that generates some problems (additionally to what I solved in the above mentioned post).

j <- 1
MHP_codes <-  c(150140:150180) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
  for(code1 in MHP_codes) {
    URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
    #Reading the HTML code from the website
    URL <- read_html(URL)
    df_list[[j]] <- tryCatch(getProfile(URL), 
                             error = function(e) NA)
    j <- j + 1
  }

when the loop is done, I bind the information from different profiles into one data frame and save it.

final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")

The function (getProfile) works well on individual profiles. It works also on a small range of profiles ( c(150100:150150)). Please, note that I do not know what psychoterapist id is actually assigned; so, many URLs within the range do not exist.

However, generally speaking, tryCatch should solve this issue. When an URL is non-existent (and thus the ID is not associated to any psychoterapist), each of the 2 nodes (and thus each of the 2 corresponding variables in my data frame) are empty (i.e. the data frame shows NAs in the corresponding cells).

However, in some IDs ranges, two problems might happen.

First, I get one error message such as teh following one:

Error in open.connection(x, "rb") : HTTP error 404.

So, this happens despite the fact that I am usign tryCatch and despite the fact that it generally appears to work (at least, until the error message appear).

Moreover, after the loop is stopped and R runs the line:

final_df <- rbind.fill(df_list)

A second error message appears:

Warning message: In df[[var]] : closing unused connection 3 (https://www.psychologytoday.com/us/therapists/illinois/150152)

It seems like there is a specific problem with that one empty URL. In fact, when I change ID range, the loop works well despite non-existent URLs: on one hand, when the URL exists the information is scraped from the website, on the other hand, when the URL does not exists, the 2 variables associated to that URL (and thus to that psyciotherapist ID) get an NA.

Is it possible, perhaps, to tell R to skip the URL if it is empty? Without recording anything? This solution would be excellent, since it would shrink the data frame to the existing URLs, but I do not know how to do it and I do not know whether it is a solution to my problem.

Anyone who is able to help me sorting out this issue?

I think you need to wrap a tryCatch around the read_html call. That’s where your function is trying to connect to the website and download the data. — Jul, Sep 27 '19 at 00:18
And the second error is not an error, it’s just a warning about your open connections and shouldn’t have any impact on the results. Does final_df not look as expected? — Jul, Sep 27 '19 at 00:25
You are right final_df looks as expected. Tomorrow I will try the suggestion on tryCatch! Thanks for both comments. Why did not you make an answer out of them? — Fuca26, Sep 27 '19 at 01:11

score 1 · Accepted Answer · answered Sep 27 '19 at 02:04

Yes, you need to wrap a tryCatch around the read_html call. This is where R tries to connect to the website, so it will throw an error (as opposed to returning an empty object) there if fails to connect. You can catch that error and then use next to tell R to skip to the next iteration of the loop.

library(rvest)
##Valid URL, works fine
URL <- "https://news.bbc.co.uk"
read_html(URL)

##Invalid URL, error raised
URL <- "https://news.bbc.co.uk/not_exist"
read_html(URL)
##Leads to error
Error in open.connection(x, "rb") : HTTP error 404.

##Invalid URL, catch and skip to next iteration of the loop
URL <- "https://news.bbc.co.uk/not_exist"
tryCatch({
URL <- read_html(URL)},
error=function(e) {print("URL Not Found, skipping")
                  next})

score 0 · Answer 2 · answered Sep 27 '19 at 16:14

I would like to thank @Jul for the answer. Here I post my updated loop:

  j <- 1
MHP_codes <-  c(150000:150200) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
  delayedAssign("do.next", {next})
  URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
  #Reading the HTML code from the website
  URL <-  tryCatch(read_html(URL), 
           error = function(e) force(do.next))

  df_list[[j]] <- getProfile(URL)
  j <- j + 1
}
final_df <- rbind.fill(df_list)

As you can see, something had to be changed: although the answer from @Jul was close to solve the problem, the loop still stopped, and thus I had to slightly change the original suggestion. In particular, I have introduced in the loop but outside of the tryCatch function the following line:

delayedAssign("do.next", {next})

And in the tryCatch function the following argument:

force(do.next)

This is based on this other stackoverlflow post.

tryCatch function works on most non-existent URLs, but it does not work in (at least) one case

2 Answers2

Linked