Identify a weblink in bold in R

Question

The following script allows me to get to a website with several links with similar names. I want to get only one of them, which can be diferentiated from the others because it is printed in bold in the website. However, i could not find a way of selecting a bold link within a list.

Would anyone have ahint on this? Thanks in advance!

library(httr)
library(rvest)
sp="Alnus japonica"

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = unlist(strsplit(as.character(sp), split="         "))[1], 
                          yearPublished ="", 
                          species = unlist(strsplit(as.character(sp), split="    "))[2], 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 
pg <- content(res, as="parsed") 
lnks <- html_attr(html_nodes(pg,"a"),"href")
#how get the url of the link wth accepted name (in bold)?
res2 <- try(GET(sprintf("http://apps.kew.org%s", lnks[grep("id=",lnks)]      [1])),silent=T)
#this gets a link but often fails to get the bold one

It depends a lot on how it was made bold. If it's inline styling, that's pretty easy, but it's probably CSS applied to a particular id or class, which means digging through the code. — alistaire, May 05 '16 at 23:37
If you search manually, you actually do get a `` tag, but it doesn't seem to show up in the `httr` results, so it must be inserted after the fact somehow. — alistaire, May 05 '16 at 23:52
The links are surrounded by `` tags, so you should be able to get them that way. Like alistaire said, not sure why `httr` is deleting them (I've no experience with `httr`, there may be an option...) — MichaelChirico, May 05 '16 at 23:52
`libxml2` (which powers `rvest` & `XML`) is not as flexible as a browser. `` outside a `
` is technically invalid HTML/XML and `libxml2` parses it that way. — hrbrmstr, May 06 '16 at 01:52

hrbrmstr · Answer 1 · 2016-05-06T10:23:13.170

First, grab tidy-html5 (it works on pretty much everything) and install it and ensure it's in your PATH.

As my comment said, browsers handle  outside  as they need to be bulletproof. libxml2 does not. So, we need to clean this up first (and I now need to make a new tidyhtml package) and then process the tidied version:

library(xml2)
library(httr)
library(rvest)

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = "Alnus", 
                          yearPublished ="", 
                          species = "japonica", 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 

tf <- tempfile(fileext=".html")
cat(content(res, as="text"), file=tf)

tidy <- system2("tidy", c("-q", tf), TRUE)

pg <- read_html(paste0(tidy, sep="", collapse=""))

html_nodes(pg, xpath=".//p/b/a[contains(@href, 'name_id')]")

## {xml_nodeset (1)}
## [1] <a href="/wcsp/namedetail.do?name_id=6471" class="onwa ...

If CSS selectors are desired over XPath:

html_nodes(pg, "p > b > a[href*='name_id']")

UPDATE

I started a basic pkg wrapper for libtidy. If you're on OS X and use Homebrew you can do: brew install tidy-html5 (which installs the binary above and the libtidy library) and devtools::install_github("hrbrmstr/tidyhtml") to install the pkg. Then, it's just:

library(xml2)
library(httr)
library(rvest)
library(htmltidy)

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = "Alnus", 
                          yearPublished ="", 
                          species = "japonica", 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 

tidy_html <- tidy(content(res, as="text"))

pg <- read_html(tidy_html)

html_nodes(pg, "p > b > a[href*='name_id']")

I should be able to get this to work on Windows & linux and make it a real package (it's a thin wrapper w/o error checking now) but that'll be down on the TODO for a while.

wow, awesome! it would've taken me a week to figure this out. — MichaelChirico, May 06 '16 at 14:32
thank you very much! i am trying to install tidy in windows 64 from github using cmake, but not so easy...any good tutorial is appreciated. — Agus camacho, May 07 '16 at 06:35
They have [binaries](https://github.com/htacg/tidy-html5/releases/tag/5.2.0) for Windows. — hrbrmstr, May 07 '16 at 11:46
thanks, now that worked for the example, but not for other species, like Abies amabilis. Despite is a valid name, i got this error: lnks <- html_attr(html_nodes(pg, xpath=".//p/b/a[contains(@href, 'name_id')]")) -Error in node_attr(x$node, name = attr, missing = default, nsMap = ns) : argument "name" is missing, with no default. Should i use a list of potential identifiers? — Agus camacho, May 09 '16 at 14:32
The package now compiles on Windows and is on CRAN but until CRAN is up to 0.3.0 (I found a nasty bug right after the CRAN submission) it's best to use the github/dev version. — hrbrmstr, Sep 11 '16 at 13:14

score 1 · Answer 2 · answered May 06 '16 at 00:24

Seems to me like there might be a bug with rvest/httr here, as  appears to surround <a href...> on the relevant link, but not in the parsed version.

I used:

library(rvest)
sp=strsplit("Alnus japonica", " ")[[1]]

session <- html_session("http://apps.kew.org/wcsp/advsearch.do")
form <- html_form(session)[[1]]

filled_form <- set_values(form, genus = sp[1], species = sp[2])

out <- submit_form(session, filled_form)

Look at the following:

out %>% html_nodes(xpath = "descendant-or-self::*") %>% `[`(81:90)
# {xml_nodeset (10)}
#  [1] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
#  [2] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
#  [3] <i>Alnus</i>
#  [4] <i> japonica</i>
#  [5] <b>\n        </b>
#  [6] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
#  [7] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
#  [8] <i>Alnus</i>
#  [9] <i> japonica</i>
# [10] <p><a # href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...

As you can see, the  node appears empty. However, when I enter the search manually and View Source on Chrome, I see:

<b>
    <p><a href="/wcsp/namedetail.do?name_id=6471" class="onwardnav"><i>Alnus</i><i> japonica</i> (Thunb.) Steud., Nomencl. Bot., ed. 2, 1: 55 (1840).</a>
    </p>
</b>

That the relevant <a> is between  and  tells me it should be a child of that , but this comes up blank:

out %>% html_nodes(xpath = "//b/child::*")

I'm admittedly no xpath expert, so I could be mucking things up here. Hope this helps get you on your way.

Identify a weblink in bold in R

2 Answers2