0

Building on an answer to a former question of mine I'm scraping this website for links with the Rselenium-package using the following code:

startServer() 
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444, 
                  browserName = "chrome")

remDr$open(silent = TRUE)
remDr$navigate("http://karakterstatistik.stads.ku.dk/")
Sys.sleep(2)

webElem <- remDr$findElement("name", "submit")
webElem$clickElement()
Sys.sleep(5)

html_source <- vector("list", 100)
i <- 1
while (i <= 100) {
  html_source[[i]] <- remDr$getPageSource()
  webElem <- remDr$findElement("id", "next")
  webElem$clickElement()
  Sys.sleep(2)
  i <- i + 1
}
Sys.sleep(3)
remDr$close()

When I want to scrape the above created vector of strings (html_source) using the rvest-package I get an error as the source is not an HTML-file:

kar.links = html_source %>% 
  read_html(encoding = "UTF-8") %>% 
  html_nodes("#searchResults a") %>% 
  html_attr("href")

I've tried to collapse the vector and tried to look for a string-to-HTML converter, but nothing seems to work. I feel the solution lies somewhere in how I save the page-sources in the loop.

EDIT: fixed it by this less than beautiful solution:

links <- vector("list", 100)
i <- 1
while (i <= 100) {
links[[i]] <- html_source[[i]][[1]] %>% 
  read_html(encoding = "UTF-8") %>% 
  html_nodes("#searchResults a") %>% 
  html_attr("href") 
i <- i + 1
}
col_links<- links %>% 
unlist()
Community
  • 1
  • 1
ScrapeGoat
  • 47
  • 1
  • 9

1 Answers1

1

html_source is a nested list:

str(head(html_source, 3))
# List of 3
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__

In your case, html_source is made up of 100 elements; each element is itself a list with one element, which is a string (and the raw html code). Therefore, to get each raw html page, you need to access html_source[[1]][[1]], html_source[[2]][[1]], and so on.

To flatten html_source, you can do: lapply(html_source, `[[`, 1). We get the same result if we use remDr$getPageSource()[[1]] in the while loop:

str(head(html_source, 3))
# List of 3
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48
  • Thanks again. The html_source[[1]][[1]] method works like a charm, so I am able to extract the links from each object on the list. The lapply() function works as well. So in order to get the links from each of the 100 objects I have to loop the hrvest functions? Or is there another way so this can be done in one simple step? _EDIT: pressed enter too soon_ – ScrapeGoat Aug 17 '16 at 13:32
  • You could collapse the list using `unlist`, then use `paste(..., collapse = "")` to get one (very) long character string, that you can then parse for links. – Weihuang Wong Aug 17 '16 at 13:33
  • Using the following code w/ and w/o the lapply-function only produce a vector of 179 links for some reason? flat<-lapply(html_source, `[[`, 1) %>% unlist() %>% paste(collapse = '') links <- flat %>% read_html(encoding = "UTF-8") %>% html_nodes("#searchResults a") %>% html_attr("href") – ScrapeGoat Aug 17 '16 at 13:58
  • Found a solution, I have edited it into the question – ScrapeGoat Aug 17 '16 at 15:24