I'm doing some string cleaning, and I'm coming up on an issue. I have ~2,000,000 rows of address data that I need to clean up. Here is a small sample that I've made up:
addresses <- c('123 Alphabet Road, Denver, CO', '% Andrew L. Doe P.O. BOX 123, New York, NY', '19 Serious Road, Providence, RI', '% Johnny Cupcakes 1947 Numbers Avenue, Boston, MA')
I'd like to keep the first and third elements as is. For the second and fourth elements, I'd like to remove everything before either the "P.O" or 1947.
Although simply removing all characters until I hit a number or a "P.O." could work, I'm afraid that some of my addresses may be viable with other alphabetical characters.
As I see it, my approach will have to follow these steps:
1) Search for the "%"
at the beginning of the string: "^\\%"
2) Replace the sub-containing "%"
all the way until the first "P\\.O\\."
or [:digit:]
I'm still trying to figure out look behinds/aheads. I suspect that I'll have to do a mix of both in order to obtain:
c('123 Alphabet Road, Denver, CO', 'P.O. BOX 123, New York, NY', '19 Serious Road, Providence, RI', '1947 Numbers Avenue, Boston, MA')
Any help is greatly appreciated!
Sincerely, Andy