-3

I need to understand how to have substring-before or -after apply to multiple nodes.

The code immediately below returns not just the city I want but additional unwanted details.

require(XML)
require(httr)
doc <- htmlTreeParse("http://www.cpmy.com/contact.asp", useInternal = TRUE)

> (string <- xpathSApply(doc, "//div[@id = 'leftcol']//p", xmlValue, trim = TRUE))
[1] "Philadelphia Office1880 JFK Boulevard10th FloorPhiladelphia, PA 19103Tel: 215-587-1600Fax: 215-587-1699Map and Directions"           
[2] "Westmont Office216 Haddon AvenueSentry Office Plaza, Suite 703Westmont, NJ 08108Tel: 856-946-0400Fax: 856-946-0399Map and Directions"
[3] "Boston Office50 Congress StreetSuite 430Boston, MA 02109Tel: 617-854-8315Fax: 617-854-8311Map and Directions"                        
[4] "New York Office5 Penn Plaza23rd FloorNew York, NY 10001Tel: 646-378-2192Fax: 646-378-2001Map and Directions" 

I added substring-before(), but it returns only the first element, correctly shortened, but not the remaining three:

> (string <- xpathSApply(doc, "substring-before(//div[@id = 'leftcol']//p, 'Office')", xmlValue, trim = TRUE))
[1] "Philadelphia "

How should I revise my XPath expression to extract in shortened form -- before "Office" all four elements?

Thank you.

lawyeR
  • 7,488
  • 5
  • 33
  • 63
  • 1
    In what this is different from your [previous question](http://stackoverflow.com/questions/26202615/why-different-results-with-xpath-1-0-and-rcurl-vs-httr-using-substring-before-a)? I think you should do more effort on your own before asking here. – agstudy Oct 05 '14 at 20:21
  • I would also suggest that you take the effort to read the answers given to a number of your previous questions. – jdharrison Oct 05 '14 at 20:24

1 Answers1

1

If you must process this using XPATH then a two step process may be utilised. The nodes are selected first then the substring processing is done from the current node :

require(XML)
doc <- htmlParse("http://www.cpmy.com/contact.asp")
sapply(doc["//div[@id = 'leftcol']//p"]
         , getNodeSet, "substring-before(./b/text(), 'Office')")

[1] "Philadelphia " "Westmont "     "Boston "       "New York " 

http://www.w3.org/TR/xpath/#section-String-Functions in XPATH 1.0

A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.

so you will only be returned one result hence the need for a two step process. In XPATH 2.0 you could use a string function within the XPATH so

"//div[@id = 'leftcol']//p/b/text()[substring-before(. , 'Office')]"

or something similar would probably return what you were looking for.

jdharrison
  • 30,085
  • 4
  • 77
  • 89