After much manual effort, I have created XPath expressions that retrieve from parsed website HTML the cities in which nearly 100 companies have offices. To late it dawned on me that zip codes for those cities would be excellent and distinctive.
I realize that no single piece of additional XPath code will cover all situations, but is there some generic expression that might reasonably often retrieve 5 digits in a row (or 5 digits, a hyphen and 4 digits) in the same or nearby namespace [assume zip codes come after city]?
For example, the code
//div[@class='content']//h5
might add something like "and \\d{5}"
[I am not adept with XPath and am using R regex syntax with backslashes for five and only five digits]. I could then quickly paste on the additional code and see whether it brings back zip codes, doing the rest by hand.
RESPONDING TO COMMENTS:
Here is one of hundreds of HTML codes:
<div class="container">
<script type="text/javascript">
<div class="header">
<table>
<tbody>
<tr>
<td class="bodywrap" valign="top">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<table class="body" width="100%" cellspacing="0" cellpadding="20" border="0">
<tbody>
<tr>
<td>
<table class="body" width="100%" cellspacing="3" cellpadding="0" border="0">
<tbody>
<tr valign="top">
<td width="50%">
<table width="100%" cellspacing="0" cellpadding="0">
<img border="0" src="/files/Office/ac28ef17-906a-4ed0-9850-0af853da6abe/Presentation/ceOfficeNameImage/t_NewYork.gif">
<br>
<br>
1251 Avenue of the Americas
<br>
New York, New York 10020
<br>
T:
<span class="skype_c2c_print_container notranslate">212.262.6700</span>
I have tried this XPath expresssion, which does not succeed. Even if it were to extract "New York" I would like to be able to extract "10020".
I am on Windows 8 with XML 1.0 (there is no XML 2.0 according to a 2013 SO post).
RESPONDING TO COMMENT: Here is an example of code I use.
doc <- htmlTreeParse("http://www.butlersnow.com/Contact_Us.aspx", useInternal = TRUE)
xpathSApply(doc, "//div[@class='content']//h5", xmlValue, trim = TRUE)