8

The following script allows me to get to a website with several links with similar names. I want to get only one of them, which can be diferentiated from the others because it is printed in bold in the website. However, i could not find a way of selecting a bold link within a list.

Would anyone have ahint on this? Thanks in advance!

library(httr)
library(rvest)
sp="Alnus japonica"

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = unlist(strsplit(as.character(sp), split="         "))[1], 
                          yearPublished ="", 
                          species = unlist(strsplit(as.character(sp), split="    "))[2], 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 
pg <- content(res, as="parsed") 
lnks <- html_attr(html_nodes(pg,"a"),"href")
#how get the url of the link wth accepted name (in bold)?
res2 <- try(GET(sprintf("http://apps.kew.org%s", lnks[grep("id=",lnks)]      [1])),silent=T)
#this gets a link but often fails to get the bold one
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
Agus camacho
  • 868
  • 2
  • 9
  • 24
  • It depends a lot on how it was made bold. If it's inline styling, that's pretty easy, but it's probably CSS applied to a particular id or class, which means digging through the code. – alistaire May 05 '16 at 23:37
  • If you search manually, you actually do get a `` tag, but it doesn't seem to show up in the `httr` results, so it must be inserted after the fact somehow. – alistaire May 05 '16 at 23:52
  • The links are surrounded by `` tags, so you should be able to get them that way. Like alistaire said, not sure why `httr` is deleting them (I've no experience with `httr`, there may be an option...) – MichaelChirico May 05 '16 at 23:52
  • 1
    `libxml2` (which powers `rvest` & `XML`) is not as flexible as a browser. `` outside a `

    ` is technically invalid HTML/XML and `libxml2` parses it that way.

    – hrbrmstr May 06 '16 at 01:52

2 Answers2

9

First, grab tidy-html5 (it works on pretty much everything) and install it and ensure it's in your PATH.

As my comment said, browsers handle <b> outside <p> as they need to be bulletproof. libxml2 does not. So, we need to clean this up first (and I now need to make a new tidyhtml package) and then process the tidied version:

library(xml2)
library(httr)
library(rvest)

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = "Alnus", 
                          yearPublished ="", 
                          species = "japonica", 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 

tf <- tempfile(fileext=".html")
cat(content(res, as="text"), file=tf)

tidy <- system2("tidy", c("-q", tf), TRUE)

pg <- read_html(paste0(tidy, sep="", collapse=""))

html_nodes(pg, xpath=".//p/b/a[contains(@href, 'name_id')]")

## {xml_nodeset (1)}
## [1] <a href="/wcsp/namedetail.do?name_id=6471" class="onwa ...

If CSS selectors are desired over XPath:

html_nodes(pg, "p > b > a[href*='name_id']")

UPDATE

I started a basic pkg wrapper for libtidy. If you're on OS X and use Homebrew you can do: brew install tidy-html5 (which installs the binary above and the libtidy library) and devtools::install_github("hrbrmstr/tidyhtml") to install the pkg. Then, it's just:

library(xml2)
library(httr)
library(rvest)
library(htmltidy)

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = "Alnus", 
                          yearPublished ="", 
                          species = "japonica", 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 

tidy_html <- tidy(content(res, as="text"))

pg <- read_html(tidy_html)

html_nodes(pg, "p > b > a[href*='name_id']")

I should be able to get this to work on Windows & linux and make it a real package (it's a thin wrapper w/o error checking now) but that'll be down on the TODO for a while.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • wow, awesome! it would've taken me a week to figure this out. – MichaelChirico May 06 '16 at 14:32
  • thank you very much! i am trying to install tidy in windows 64 from github using cmake, but not so easy...any good tutorial is appreciated. – Agus camacho May 07 '16 at 06:35
  • 1
    They have [binaries](https://github.com/htacg/tidy-html5/releases/tag/5.2.0) for Windows. – hrbrmstr May 07 '16 at 11:46
  • thanks, now that worked for the example, but not for other species, like Abies amabilis. Despite is a valid name, i got this error: lnks <- html_attr(html_nodes(pg, xpath=".//p/b/a[contains(@href, 'name_id')]")) -Error in node_attr(x$node, name = attr, missing = default, nsMap = ns) : argument "name" is missing, with no default. Should i use a list of potential identifiers? – Agus camacho May 09 '16 at 14:32
  • you forgot `, "href"` before the last `)` – hrbrmstr May 09 '16 at 14:56
  • The package now compiles on Windows and is on CRAN but until CRAN is up to 0.3.0 (I found a nasty bug right after the CRAN submission) it's best to use the github/dev version. – hrbrmstr Sep 11 '16 at 13:14
1

Seems to me like there might be a bug with rvest/httr here, as <b> appears to surround <a href...> on the relevant link, but not in the parsed version.

I used:

library(rvest)
sp=strsplit("Alnus japonica", " ")[[1]]

session <- html_session("http://apps.kew.org/wcsp/advsearch.do")
form <- html_form(session)[[1]]

filled_form <- set_values(form, genus = sp[1], species = sp[2])

out <- submit_form(session, filled_form)

Look at the following:

out %>% html_nodes(xpath = "descendant-or-self::*") %>% `[`(81:90)
# {xml_nodeset (10)}
#  [1] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
#  [2] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
#  [3] <i>Alnus</i>
#  [4] <i> japonica</i>
#  [5] <b>\n        </b>
#  [6] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
#  [7] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
#  [8] <i>Alnus</i>
#  [9] <i> japonica</i>
# [10] <p><a # href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...

As you can see, the <b> node appears empty. However, when I enter the search manually and View Source on Chrome, I see:

<b>
    <p><a href="/wcsp/namedetail.do?name_id=6471" class="onwardnav"><i>Alnus</i><i> japonica</i> (Thunb.) Steud., Nomencl. Bot., ed. 2, 1: 55 (1840).</a>
    </p>
</b>

That the relevant <a> is between <b> and </b> tells me it should be a child of that <b>, but this comes up blank:

out %>% html_nodes(xpath = "//b/child::*")

I'm admittedly no xpath expert, so I could be mucking things up here. Hope this helps get you on your way.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198