0

When I use XPath 1.0's substring-before or -after in an expression, something happens that makes my subsequent xmlValue call throw an error. The code below shows that the XPath expression works fine with httr, but then doesn't work with RCurl.

require(XML)
require(httr)
doc <- htmlTreeParse("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp", useInternal = TRUE)
(string <- xpathSApply(doc, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')", xmlValue, trim = TRUE))


require(RCurl)
fetch <- GET("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
contents <- content(fetch)
locsnodes <- getNodeSet(contents, "//div[@id = 'contactInformation']//p")  
sapply(locsnodes, xmlValue)

[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n Phone: 432-897-1440\r\n Toll Free: 866-721-6665\r\n Fax: 432-682-3672"

The code above works OK, but I want to use substring-before it to clean up the result like this:

[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

locsnodes <- getNodeSet(contents, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')")  
sapply(locsnodes, xmlValue)

Error in UseMethod("xmlValue") : 
  no applicable method for 'xmlValue' applied to an object of class "character"

How can I use substring- and also RCurl, because RCurl is the chosen package for a more complicate operation used later?

Thank you for any guidance (or better way to achieve what I want

lawyeR
  • 7,488
  • 5
  • 33
  • 63
  • You could just do an `xpathSApply(contents, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')", xmlValue, trim = TRUE)` – hrbrmstr Oct 05 '14 at 13:12
  • The function call is redundant here so `doc["substring-before(//div[@id = 'contactInformation']//p, 'Phone')"]` would do the trick. – jdharrison Oct 05 '14 at 13:20
  • You don't use httr anywhere? – hadley Oct 09 '14 at 11:43
  • @hadley. I am actually scraping 600+ sites and I use httr in the function that runs through all the steps for downloading the html, parsing it and then running the XPath expression. I realize now, belatedly, that I should have explained that I need to avoid R regex solutions because that would require lots of one-off coding. If I can do it in XPath, I am far better off. – lawyeR Oct 09 '14 at 15:23

2 Answers2

3

The fun argument in xpathSApply or indeed getNodeSet is only called if a node set is returned. In your case a character string is being returned and the function is ignored:

require(XML)
require(RCurl)
doc <- htmlParse("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
locsnodes <- getNodeSet(doc
                        , "substring-before(//div[@id = 'contactInformation']//p, 'Phone')")  
> locsnodes
[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

> str(locsnodes)
 chr "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

The fun argument is not being used here in xpathSApply

> xpathSApply(doc, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')"
+             , function(x){1}
+ )
[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

as your xpath is not returning a node set.

jdharrison
  • 30,085
  • 4
  • 77
  • 89
1

Here's a slightly different approach using the rvest package. I think you're generally better off doing string manipulation in R, rather than in xpath

library(rvest)

contact <- html("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")

contact %>%
  html_node("#contactInformation p") %>%
  html_text() %>%
  gsub(" Phone.*", "", .)
#> [1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n"
hadley
  • 102,019
  • 32
  • 183
  • 245