1

I'm doing some string cleaning, and I'm coming up on an issue. I have ~2,000,000 rows of address data that I need to clean up. Here is a small sample that I've made up:

addresses <- c('123 Alphabet Road, Denver, CO', '% Andrew L. Doe P.O. BOX 123, New York, NY', '19 Serious Road, Providence, RI', '% Johnny Cupcakes 1947 Numbers Avenue, Boston, MA')

I'd like to keep the first and third elements as is. For the second and fourth elements, I'd like to remove everything before either the "P.O" or 1947.

Although simply removing all characters until I hit a number or a "P.O." could work, I'm afraid that some of my addresses may be viable with other alphabetical characters.

As I see it, my approach will have to follow these steps:

1) Search for the "%" at the beginning of the string: "^\\%" 2) Replace the sub-containing "%" all the way until the first "P\\.O\\." or [:digit:]

I'm still trying to figure out look behinds/aheads. I suspect that I'll have to do a mix of both in order to obtain:

c('123 Alphabet Road, Denver, CO', 'P.O. BOX 123, New York, NY', '19 Serious Road, Providence, RI', '1947 Numbers Avenue, Boston, MA')

Any help is greatly appreciated!

Sincerely, Andy

Andy B
  • 49
  • 4

1 Answers1

1

Perhaps

gsub("^\\%.*?, +", "", addresses)
#output:
[1] "123 Alphabet Road, Denver, CO"   "P.O. BOX 123, New York, NY"      "19 Serious Road, Providence, RI"
[4] "1947 Numbers Avenue, Boston, MA

remove from % at the start of the string till the first comma and as many white spaces after the comma there are.

EDIT: with the tougher example:

using lookahead:

gsub("^\\%.*?(?=(P\\.O\\.|\\d))", "", addresses, perl = T)
#output
[1] "123 Alphabet Road, Denver, CO"   "P.O. BOX 123, New York, NY"      "19 Serious Road, Providence, RI"
[4] "1947 Numbers Avenue, Boston, MA"

^\\% - match % at start of string
.*? - lazy match any characters (the least needed to further match the string) - try without it (instead of 1947 only 7 is left since regex is greedy by nature)
?= positive lookahead (zero-length assertion) - in parentheses
(P\\.O\\.|\\d) - P.O. or digit
perl = T - to be able to use lookahead/lookbehind

using capture groups:

gsub("^\\%.*?(\\d|P\\.O\\.)", "\\1", addresses, perl = T)
#output
[1] "123 Alphabet Road, Denver, CO"   "P.O. BOX 123, New York, NY"      "19 Serious Road, Providence, RI"
[4] "1947 Numbers Avenue, Boston, MA"

^\\%.*? - same as above
() - capture group - we are able to reference symbols in it with \\1 later on , up to 9 capture groups are permitted \\1...\\9 \\d|P\\.O\\. digit or P.O.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • This could work! However, I think it will only be relevant for those observations where there's a comma after the irrelevant sub-string. What would happen if one of the rows contained "% c/o John Doe 987 Ninety Street, Los Angeles, CA"? – Andy B Oct 11 '17 at 18:05
  • Thanks! I've made the changes so that a comma anchor can't be used. – Andy B Oct 11 '17 at 18:13
  • The code works on my data set. Could you do a quick translate? My reading is that the regular expression detects % at the start of the string ("^\\%"") and then replaces that plus all characters (".*?") which is defined as optional since there may be no characters between % and the look-ahead assertion ("?=..."). Is this correct? – Andy B Oct 11 '17 at 18:32