2

After much manual effort, I have created XPath expressions that retrieve from parsed website HTML the cities in which nearly 100 companies have offices. To late it dawned on me that zip codes for those cities would be excellent and distinctive.

I realize that no single piece of additional XPath code will cover all situations, but is there some generic expression that might reasonably often retrieve 5 digits in a row (or 5 digits, a hyphen and 4 digits) in the same or nearby namespace [assume zip codes come after city]?

For example, the code

//div[@class='content']//h5

might add something like "and \\d{5}" [I am not adept with XPath and am using R regex syntax with backslashes for five and only five digits]. I could then quickly paste on the additional code and see whether it brings back zip codes, doing the rest by hand.

RESPONDING TO COMMENTS:

Here is one of hundreds of HTML codes:

<div class="container">
<script type="text/javascript">
<div class="header">
<table>
<tbody>
<tr>
<td class="bodywrap" valign="top">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<table class="body" width="100%" cellspacing="0" cellpadding="20" border="0">
<tbody>
<tr>
<td>
<table class="body" width="100%" cellspacing="3" cellpadding="0" border="0">
<tbody>
<tr valign="top">
<td width="50%">
<table width="100%" cellspacing="0" cellpadding="0">
<img border="0" src="/files/Office/ac28ef17-906a-4ed0-9850-0af853da6abe/Presentation/ceOfficeNameImage/t_NewYork.gif">
<br>
<br>
1251 Avenue of the Americas
<br>
New York, New York 10020
<br>
T:
<span class="skype_c2c_print_container notranslate">212.262.6700</span>

I have tried this XPath expresssion, which does not succeed. Even if it were to extract "New York" I would like to be able to extract "10020".

I am on Windows 8 with XML 1.0 (there is no XML 2.0 according to a 2013 SO post).

RESPONDING TO COMMENT: Here is an example of code I use.

doc <- htmlTreeParse("http://www.butlersnow.com/Contact_Us.aspx", useInternal = TRUE)
xpathSApply(doc, "//div[@class='content']//h5", xmlValue, trim = TRUE)
lawyeR
  • 7,488
  • 5
  • 33
  • 63
  • 3
    XPath 1.0 or 2.0? What do your existing XPath expressions look like? What does the HTML look like? – LarsH Aug 14 '14 at 01:48
  • systemInfo() says I am running other attached packages: [1] XML_3.98-1.1. I did show one XPath expression in the question; there are about 65 variations -- all over the map. Likewise, the HTML varies from company to company. The question asks about how to phrase an XPath expression to ADD ON to existing ones that have located the address nodes. – lawyeR Aug 14 '14 at 09:59
  • 1
    What is a "nearby namespace"? In what environment do you use XPath? We need to see the code XPath is embedded into - and crucially, HTML examples. How do you expect anyone to _add on_ to an unknown something? – Mathias Müller Aug 14 '14 at 10:20
  • I added an example. My terminology is probably wrong on "nearby namespace" but I thought that meant the node that returns the city -- so the zip code of the address is very likely "nearby". – lawyeR Aug 14 '14 at 10:29
  • We still need to know what version of XPath (not XML) you are running, and what environment you're running it in. For example, are you running XPath from within Javascript? That may tell us what version of XPath you're running. Can you show us the code where you invoke an XPath expression? – LarsH Aug 14 '14 at 10:40
  • 1
    I apologize for being so thick. My XPath comes from the R program version of XML, I believe. I edited the question to put in a sample of my code. – lawyeR Aug 14 '14 at 11:43
  • I am running Windows 8 using RStudio and the latest version of R. – lawyeR Aug 14 '14 at 11:45

1 Answers1

3

So, we've already found out that you

  • Are using packages in R that offer HTML parsing and XPath functionality
  • Are looking for an XPath expression that singles out a particular sequence of characters from HTML text content

What we have not yet established is which version of XPath is supported by this module in the R language. Either it only conforms to the XPath 1.0 standard (more likely) or it also supports XPath 2.0.

Why is this relevant? It's relevant because only XPath 2.0 offers functions that can handle regular expressions. Regexes are there to solve the very problem you describe, that is, finding 5 digits in a row in arbitrary strings. Now, how do you find out which version is supported? Simply use a function that is only available in XPath 2.0, for example tokenize() and see if this raises an error.

Option 1: This R functionality turns out to support XPath 2.0

First, identify the elements that are likely to contain the ZIP code. As an example let's say it is inside an h5element. Then, use the matches() function together with a regular expression.

//h5[matches(.,'\d{5}')]

Or a slight variation of this. Of course, R cannot discriminate between actual ZIP codes and other things that simply consist of five digits in a row by chance.

Option 2: Only XPath 1.0 is at your disposal

Then, in my opinion, there is no reasonable way to combine this into a single XPath expression because regexes are not available. But, R itself happens to be good at regular expressions. Extract all relevant strings from HTML with XPath and then search them with regexes in R, outside XPath.


Note: All this does not in any way "prove" that a regular expression as simple as this is precise and restrictive enough to find ZIP codes only. In a large collection of HTML documents, there might be a lot of "false positives" that cannot be distinguished from "real" hits. Then, you'd have to refine the method, e.g. check the results against a database of ZIP codes.

Since I am writing anyway, there is no such thing as a "nearby namespace". You are mistaking what in XPath is called the context item for a namespace. In the following example, http://www.ns.com is a namespace.

<ns:root xmlns:ns="http://www.ns.com">
  <ns:a/>
</ns:root>
Community
  • 1
  • 1
Mathias Müller
  • 22,203
  • 13
  • 58
  • 75
  • 1
    +1, excellent answer. I agree that the 'XML' package of R most likely only supports XPath 1.0. The documentation I found (http://cran.r-project.org/web/packages/XML/XML.pdf) doesn't seem to say explicitly, but it does reference the spec for XPath 1.0 (http://www.w3.org/TR/xpath/). – LarsH Aug 14 '14 at 14:51
  • That was excellent, @Mathias Muller. Thank you. Since I tried tokenize() and got Error: could not find function "tokenize", I must be on XPath 1.0. I looked all over but could not find how to install XPath 2.0. I had previously installed the XML package. I went to WEC3 (if that is the name) but there was no XPath 2.0 download choice that I could find. Is it on github? Sorry to be so naive, and I respect your time and attention. – lawyeR Aug 14 '14 at 15:14
  • @lawyeR You are welcome. It is very likely that there is no implementation of XPath 2.0 in R at all, simply because nobody implemented it - then you have to live with version 1.0 and stick with Option 2 in my answer. (P.S. The name is "W3C" :-) – Mathias Müller Aug 14 '14 at 15:35
  • @lawyeR: Just FYI, W3C develops the standards (specifications, aka "recommendations") for how XPath (and other technologies) should behave, but does not develop or publish implementations. – LarsH Aug 14 '14 at 17:01