I have spreadsheets with poorly-formed addresses in them, and I need them to be reasonably good to use for geocoding. I've tried some of the Python libraries for parsing addresses, but they aren't able to figure out some of them. As an example,
"N MONON AVE FRANCESVILLE W YELLOW ST"
The state for all of them is Indiana, which I have no problem concatenating into the submitted string. In the above example, it is an intersection, which the geocoder does accept as:
"N MONON AVE & W YELLOW ST FRANCESVILLE"
My thinking is that the easiest way is to find the first word after a street type (Ave, Dr, Ct, etc.), move it to the end, and add an ampersand in its place.
I have this code, which is probably horribly inefficient, but it does capture only the first street type; in the above example, it will output AVE.
/(Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)[^(Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)]/i
What I'm not sure how to do is tell it to grab whatever word is immediately after the first instance of a street type. From there, I should be able to use re.search and .group[n] to extract the city, and throw into the parsed string.