-1

I'm using R for geocoding. Some of my street addresses include unit numbers and I need to remove those before geocoding, but I'm not very good at regex commands. How could I transform addresses like these:

10 Fake St, Unit #5, New York, NY 10001 10 Fake St, Units #5,6,7, New York, NY 10001

into this:

10 Fake St, New York, NY 10001

Thanks!

user3786999
  • 1,037
  • 3
  • 13
  • 24

2 Answers2

0

just replace \s+Units? \S+ with ""

\s+ is one or more spaces

Units? matches Unit or Units followed by a space

\S+ matches one or more non-whitespace characters (so it will match until the next space)

emsimpson92
  • 1,779
  • 1
  • 9
  • 24
  • Thanks! But what some of the addresses don't have the word "units"? Basically how do I get rid of everything between the first comma and the last comma? – user3786999 Jul 16 '18 at 21:43
  • `.*?,(.*),.*` will do that, but I don't recommend using such an open ended pattern. You could end up with more than you're looking for. [You can see it here](https://regex101.com/r/wS80KW/7) – emsimpson92 Jul 16 '18 at 21:47
  • You're right that doesn't work well. What about everything between the first comma and the substring "New York,"? – user3786999 Jul 16 '18 at 23:26
  • [Just add in the new york part `.*?,(.*) New York,.*`](https://regex101.com/r/wS80KW/10) Then again, this only works if it contains "New York". I suggest you find some sort of pattern that can be reproduced for the data that you don't want. – emsimpson92 Jul 16 '18 at 23:28
  • If it always contains "#" when referring to units, you could do `([\s\w]+#\S+)` [Demo](https://regex101.com/r/wS80KW/15) – emsimpson92 Jul 16 '18 at 23:37
0

We look for Unit with an optional appended s (Units?) and then a # followed by numbers and a comma an infinite number of times (#(\d+,)+). The one extra change is to add an optional space after the comma to fix formatting (#(\d+, ?)+).

/Units? #(\d+, ?)+/g

const str = `10 Fake St, Unit #5, New York, NY 10001
10 Fake St, Units #5,6,7, New York, NY 10001`;

console.log(str.replace(/Units? #(\d+, ?)+/g, ""));
Nick Abbott
  • 364
  • 1
  • 9