Web scraping and looping through pages with R

Question

I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am exercising with a few pages on Psychology Today.

I have written a function that allows me to scrape information for one therapist and to create a data set with the information collected in this way:

install.packages('rvest') #Loading the rvest package
install.packages('xml2') #Loading the xml2 package
library('rvest') #to scrape
library('xml2')  #to handle missing values (it works with html_node, not with html_nodes)

#Specifying the url for desired website to be scraped
url <- 'https://www.psychologytoday.com/us/therapists/THE_ONE_YOU_WANT'

#Reading the HTML code from the website
URL <- read_html(url)

#creating the function
getProfile <- function(profilescrape) {

      ##NAME
            #Using CSS selectors to name
            nam_html <- html_node(URL,'.contact-name')
            #Converting the name data to text
            nam <- html_text(nam_html)
            #Let's have a look at the rankings
            head(nam)
            #Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help 
            #                                   me separate each item within the same type of 
            #                                   information)
            nam<-gsub("\n","",nam)
            head(nam)
            #Convering each info from text to factor
            nam<-as.factor(nam)
            #Let's have a look at the name
            head(nam)


        ##MODALITIES
            #Using CSS selectors to modality
            mod_html <- html_node(URL,'.attributes-modality .copy-small')
            #Converting the name data to text
            mod <- html_text(mod_html)
            #Let's have a look at the rankings
            head(mod)
            #Convering each info from text to factor
            mod<-as.factor(mod)
            #Let's have a look at the rankings
            head(mod)


        ##Combining all the lists to form a data frame
              onet_df<-data.frame(Name = nam,
                                  Modality = mod)

        ##Structure of the data frame
        str(onet_df)

            }

View(onet_df)

This code seems to be working well for whatever therapist I choose. Now, I would like to use this function on multiple profiles, to generate one data set, with name and modality of MHPs. Let's say that I want to apply the above function "getProfile" to the first 20 therapists in Illinois and input the information for this 20 therapists in a data set called "onet_df"

j <- 1
MHP_codes <-  c(324585 : 449807) #therapist identifier
withinpage_codes <-  c(1 : 20) #therapist running number
  for(code1 in withinpage_codes) {
    for(code2 in MHP_codes) {
      URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
      record_profile <- getProfile <- function(profilescrape)
      onet_df[[j]] <- rbind.fill(onet_df, record_profile)
      j <- j + 1
      }
}

EDITS START HERE:

This loop does not create any data set; moreover, it does not give any error message. Would someone be able to help me de-bug this loop? Please, keep in mind that I am a real beginner.

Following sueggetions, I have modified what follows at the beginning:

#creating the function
getProfile <- function(URL) {....}

Moreover, I have used three alternative loops:

1st alternative

j <- 1
MHP_codes <-  c(324585 : 449807) #therapist identifier
withinpage_codes <-  c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
  for(code2 in MHP_codes) {
    URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
    record_profile <- getProfile(URL)
      onet_df[[j]] <- rbind.fill(onet_df, record_profile)
    j <- j + 1
  }
}

which gives the followin errors message: Error in UseMethod("xml_find_first") : no applicable method for 'xml_find_first' applied to an object of class "character"

2nd alternative

MHP_codes <- c(324585, 449807)  #therapist identifier 
withinpage_codes <- c(1:20)     #therapist running number 

df_list <- vector(mode = "list",
                  length = length(MHP_codes) * length(withinpage_codes))

j <- 1
for(code1 in withinpage_codes) { 
  for(code2 in MHP_codes) {
    URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf') 
    df_list[[j]] <- getProfile(URL)
    j <- j + 1 
  } 
}

final_df <- rbind.fill(df_list)

This loop gives the same error message (please, refer to the above one).

Now, I have just to figure out why no data set is produced with the loop. There might be two problems: First, something within the loop does not work (I have run both loops on only one existing page and no data set is produced) ; Second, when I run the loop on a series of link, some of them might be missing, which would produce an error message.

For a so called _real beginner_ excellent question. Wish there were more like this. + — QHarr, Sep 23 '19 at 05:36
thank you! I am just trying to replicate similar scraping projects already done by other people (I have several references, so I ended up not to mention them up here) — Fuca26, Sep 23 '19 at 18:44

Parfait · Accepted Answer · 2019-09-24T19:10:13.733

2

Consider several adjustments:

Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.
```
getProfile <- function(URL) { 
   ...
}
```
Adjust the ending of function to return the needed object. Without return, R will return the last line read. Therefore, replace str(onet_df) with return(onet_df).
Pass dynamic URL in loop to method without calling function:
```
URL <- paste0(...) 
record_profile <- getProfile(URL)
```

Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.

MHP_codes <- c(324585, 449807)  #therapist identifier 
withinpage_codes <- c(1:20)     #therapist running number 

df_list <- vector(mode = "list",
                  length = length(MHP_codes) * length(withinpade_codes))

j <- 1
for(code1 in withinpage_codes) { 
    for(code2 in MHP_codes) {
        URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf') 
        df_list[[j]] <- tryCatch(getProfile(URL), 
                                 error = function(e) NULL)
        j <- j + 1 
    } 
}

Call rbind.fill once outside loop to combine all data frames together
```
final_df <- rbind.fill(df_list)
```

With that said, consider an apply family solution, specifically Map (wrapper to mapply). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.

# ALL POSSIBLE PAIRINGS
web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
                            withinpage_codes = c(1:20))

# MOVE URL ASSIGNMENT INSIDE FUNCTION
getProfile <- function(code1, code2) { 
   URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')

    # ...same code as before...
}

# ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2), 
                                               error = function(e) NULL),
               code1 = web_codes_df$MHP_codes,
               code2 = web_codes_df$withinpage_codes)

final_df <- rbind.fill(df_list)

edited Sep 24 '19 at 19:10

answered Sep 23 '19 at 03:07

Parfait

104,375
17
94
125

Thank you @Parfait for the suggestion. Neither version of the loop seems to be working though. – Fuca26 Sep 23 '19 at 20:37
Please relay the error or undesired results. *Not working* is not very descriptive. This solution assumes there is no issue with `getProfile()`. Possibly pages are not available where you need to wrap `tryCatch` or add `Sys.sleep` for processing to finish. – Parfait Sep 23 '19 at 20:51
I know, sorry for that...I have to understand what does not work first. I'll be back soon with more details :/ – Fuca26 Sep 23 '19 at 21:02
I am not sure that I can follow either your suggestions or understand R error messages (I mentioned earlier I am a debutant). When I run the content of the function on a therapist profile, it works well--I can see the data set which is composed of 15 variables on 1 observation. Thus, I assume that the content of the function works. – Fuca26 Sep 23 '19 at 21:30
If I run the function on one profile, it also works: url <- 'https://www.psychologytoday.com/us/therapists/illinois/324585?sid=5d87fb397b155&ref=1&tr=ResultsName' /// #Reading the HTML code from the website /// URL <- read_html(url) /// onet_df <- getProfile(URL) /// #load data set /// load("onet_df.Rda") #the data set is the same as above (1 observation and 15 variables) – Fuca26 Sep 23 '19 at 21:40
1

Please edit your post with entire R error message. We can help decipher. My guess which usually happens with web scraping many pages is that *one* page does not work, breaking everything. I have edited to include `tryCatch` which will suppress errors and return `NULL` for those problem pages. Review the `df_list` object to see which item did not process. – Parfait Sep 23 '19 at 21:42
Parfait, I have solved the issue with the link. But still none of the suggested solutions work. Have you tried to run the script you suggested? Maybe it is easier to understand what is going on by comparing our results. – Fuca26 Sep 24 '19 at 16:08
1

Once again, *none of the suggested solutions work* is not helpful. Please describe what actually happens. Any errors? Is `df_list` not generated? Your post indicates an error in the web scraping not the loop which I encountered on my end as well. Change `NULL` to `print(e)`. Also, I just added a point regarding the return of an R function. Right now your function returns last line: `str(onet_dt)`. Adjust for `return(onet_dt)`. – Parfait Sep 24 '19 at 19:30
1

Btw - all the console output including `head` and `str` calls will not run inside a function unless you use `print`. Try even printing *code1* and *code2* inside `for` loop or function if using `Map` to see which is the problematic URL. – Parfait Sep 24 '19 at 19:31
thanks for the multiple suggestions! Now the code works! I will post it below as an answer. – Fuca26 Sep 24 '19 at 20:37

Fuca26 · Answer 2 · 2019-09-30T17:58:56.433

One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user. Below I post the script. I apologize if it is not presicely commented.

Here is the code.

#Loading packages
library('rvest') #to scrape
library('xml2')  #to handle missing values (it works with html_node, not with html_nodes)
library('plyr')  #to bind together different data sets

#get working directory
getwd()
setwd("~/YOUR OWN FOLDER HERE")

#DEFINE SCRAPING FUNCTION
getProfile <- function(URL) {


          ##NAME
                #Using CSS selectors to name
                nam_html <- html_node(URL,'.contact-name')
                #Converting the name data to text
                nam <- html_text(nam_html)
                #Let's have a look at the rankings
                head(nam)
                #Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help 
                #                                   me separate each item within the same type of 
                #                                   information)
                nam<-gsub("\n","",nam)
                head(nam)
                #Convering each info from text to factor
                nam<-as.factor(nam)
                #Let's have a look at the name
                head(nam)
                #If I need to remove blank space do this:
                  #Data-Preprocessing: removing excess spaces
                  #variable<-gsub(" ","",variable)


            ##MODALITIES
                #Using CSS selectors to modality
                mod_html <- html_node(URL,'.attributes-modality .copy-small')
                #Converting the name data to text
                mod <- html_text(mod_html)
                #Let's have a look at the rankings
                head(mod)
                #Convering each info from text to factor
                mod<-as.factor(mod)
                #Let's have a look at the rankings
                head(mod)

                ##Combining all the lists to form a data frame
                onet_df<-data.frame(Name = nam,                                                                                     
                                    Modality = mod)

                return(onet_df)
}

Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).

j <- 1
MHP_codes <-  c(163805:163808) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
  for(code1 in MHP_codes) {
    URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
    #Reading the HTML code from the website
    URL <- read_html(URL)
    df_list[[j]] <- tryCatch(getProfile(URL), 
                             error = function(e) NULL)
    j <- j + 1
  }

final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")

Web scraping and looping through pages with R

2 Answers2

Linked