Building on an answer to a former question of mine I'm scraping this website for links with the Rselenium-package using the following code:
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444,
browserName = "chrome")
remDr$open(silent = TRUE)
remDr$navigate("http://karakterstatistik.stads.ku.dk/")
Sys.sleep(2)
webElem <- remDr$findElement("name", "submit")
webElem$clickElement()
Sys.sleep(5)
html_source <- vector("list", 100)
i <- 1
while (i <= 100) {
html_source[[i]] <- remDr$getPageSource()
webElem <- remDr$findElement("id", "next")
webElem$clickElement()
Sys.sleep(2)
i <- i + 1
}
Sys.sleep(3)
remDr$close()
When I want to scrape the above created vector of strings (html_source) using the rvest-package I get an error as the source is not an HTML-file:
kar.links = html_source %>%
read_html(encoding = "UTF-8") %>%
html_nodes("#searchResults a") %>%
html_attr("href")
I've tried to collapse the vector and tried to look for a string-to-HTML converter, but nothing seems to work. I feel the solution lies somewhere in how I save the page-sources in the loop.
EDIT: fixed it by this less than beautiful solution:
links <- vector("list", 100)
i <- 1
while (i <= 100) {
links[[i]] <- html_source[[i]][[1]] %>%
read_html(encoding = "UTF-8") %>%
html_nodes("#searchResults a") %>%
html_attr("href")
i <- i + 1
}
col_links<- links %>%
unlist()