Dear Stackoverflow users,
I am using R to scrape profiles of a few psycotherapists from Psychology Today; this is done for exercising and learning more about web scraping.
I am new to R and I I have to go through this intense training that will help me with a future projects. It implies that I might not know precisely what I am doing at the moment (e.g. I might not interpret well either the script or the error messages from R), but I have to get it done. Therefore, I beg your pardon for possible misunderstandings or inaccuracies.
In short, the situation is the following. I have created a function through which I scrape information from 2 nodes of psycotherapists' profiles; the function is showed on this stackoverflow post.
Then I create a loop where that function is used on a few psycotherapists' profiles; the loop is in the above post as well, but I report it below because that is the part of the script that generates some problems (additionally to what I solved in the above mentioned post).
j <- 1
MHP_codes <- c(150140:150180) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NA)
j <- j + 1
}
when the loop is done, I bind the information from different profiles into one data frame and save it.
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")
The function (getProfile) works well on individual profiles. It works also on a small range of profiles ( c(150100:150150)). Please, note that I do not know what psychoterapist id is actually assigned; so, many URLs within the range do not exist.
However, generally speaking, tryCatch should solve this issue. When an URL is non-existent (and thus the ID is not associated to any psychoterapist), each of the 2 nodes (and thus each of the 2 corresponding variables in my data frame) are empty (i.e. the data frame shows NAs in the corresponding cells).
However, in some IDs ranges, two problems might happen.
First, I get one error message such as teh following one:
Error in open.connection(x, "rb") : HTTP error 404.
So, this happens despite the fact that I am usign tryCatch and despite the fact that it generally appears to work (at least, until the error message appear).
Moreover, after the loop is stopped and R runs the line:
final_df <- rbind.fill(df_list)
A second error message appears:
Warning message: In df[[var]] : closing unused connection 3 (https://www.psychologytoday.com/us/therapists/illinois/150152)
It seems like there is a specific problem with that one empty URL. In fact, when I change ID range, the loop works well despite non-existent URLs: on one hand, when the URL exists the information is scraped from the website, on the other hand, when the URL does not exists, the 2 variables associated to that URL (and thus to that psyciotherapist ID) get an NA.
Is it possible, perhaps, to tell R to skip the URL if it is empty? Without recording anything? This solution would be excellent, since it would shrink the data frame to the existing URLs, but I do not know how to do it and I do not know whether it is a solution to my problem.
Anyone who is able to help me sorting out this issue?