3

I have spreadsheets with poorly-formed addresses in them, and I need them to be reasonably good to use for geocoding. I've tried some of the Python libraries for parsing addresses, but they aren't able to figure out some of them. As an example,

"N MONON AVE FRANCESVILLE W YELLOW ST"

The state for all of them is Indiana, which I have no problem concatenating into the submitted string. In the above example, it is an intersection, which the geocoder does accept as:

"N MONON AVE & W YELLOW ST FRANCESVILLE"

My thinking is that the easiest way is to find the first word after a street type (Ave, Dr, Ct, etc.), move it to the end, and add an ampersand in its place.

I have this code, which is probably horribly inefficient, but it does capture only the first street type; in the above example, it will output AVE.

/(Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)[^(Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)]/i

What I'm not sure how to do is tell it to grab whatever word is immediately after the first instance of a street type. From there, I should be able to use re.search and .group[n] to extract the city, and throw into the parsed string.

Stephan Garland
  • 155
  • 1
  • 3
  • 10

2 Answers2

1

You may use

rx = re.compile(r"(Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)\s+(\S+)\s*(.*)", re.I)

See the regex demo

The addition is \s+(\S+)\s*(.*): 1+ whitespaces, 1+ non-whitespaces ((\S+), Group 2), 0+ whitespaces (\s*) and any 0+ chars other than line break chars (.*, Group 3).

Python demo:

import re
rx = re.compile(r"(Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)\s+(\S+)\s*(.*)", re.I)
s = "N MONON AVE FRANCESVILLE W YELLOW ST"
result = re.sub(rx, r'\1 & \3 \2', s)
print(result)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Wonderful! I'm still trying to get groups down; re.sub is quite powerful. I also tried this out on non-conforming addresses to make sure it didn't mess them up, and it worked fine. – Stephan Garland Nov 17 '16 at 13:59
  • Glad it worked for you. Please also consider upvoting if my answer proved helpful to you (see [How to upvote on Stack Overflow?](http://meta.stackexchange.com/questions/173399/how-to-upvote-on-stack-overflow)). – Wiktor Stribiżew Nov 17 '16 at 14:01
  • Let me know what "get groups down" mean, do you mean you need to also return a list of the groups? It can easily be done with a callback inside `re.sub`. – Wiktor Stribiżew Nov 17 '16 at 14:01
  • No, I meant just gaining proficiency at splitting out a string into groups and using them. I've figured out the rest of what I needed to do. Again, many thanks. – Stephan Garland Nov 17 '16 at 19:17
1
import re
s = "N MONON AVE FRANCESVILLE W YELLOW ST"
regex = r"(.*) (Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St) ([A-Za-z]*) (.*)"
result = re.sub(regex, r"\1 \2 & \4 \3", s, flags=re.I)
print(result)
eric.christensen
  • 3,191
  • 4
  • 29
  • 35