1

Writing some regex to help process street addresses. However, I'm unsure if regex is the correct way to go about solving this problem.

I have a street address that looks like this:

7829 Hollywood Ave

I would like to write a regex that says this (pseudo -ode):

match a NUMBER then ONE OR MORE WORDS then a STREET TYPE

In javascript, this regex would look something like this:

/^\d+\s+.*(\sAve|\sStreet|\sSt.|..800 MORE ABBREVIATIONS!...)/ig

As you can see, because there are 800+ postal street "type" abbreviations, this regex would be very large. I would have to actually generate it using computer code, which is ok, but I'm unsure of this is a good way to solve this kind of problem?

I could see this problem getting to the point where I want to write a regex that attempts to match a street NAME with one in a database. Then I really don't see how a standard regex could work in that situation:

match a NUMBER then **A STREET NAME IN A DATABASE** then a STREET TYPE

Any input is appreciated!

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
Chris Dutrow
  • 48,402
  • 65
  • 188
  • 258

3 Answers3

3

If all addresses were as simple as <number> <name> <type> life would be very simple - but they aren't, so it isn't.

Street addresses are too complex for a single regular expression, e.g. 5/45 East 51st St or 215-217 Long Island Way. You need to either: break it up and parse the parts, have the user input the address in specific fields, or just accept what they give you.

RobG
  • 142,382
  • 31
  • 172
  • 209
  • Unfortunately, I have no control over how the user has put the address in. – Chris Dutrow Feb 27 '12 at 00:28
  • So you're left to parsing the input or just accept what they give you. A RegExp can help with tokenising but parsing will be manual. – RobG Feb 27 '12 at 00:35
  • 1
    What's more, if you're dealing with international addresses then things get even more complex. Some countries put the number at the end of the street line. Some places don't have a number at all. Some street names have a number in. Not all street names have a type. (Some places don't name streets either, but that's a problem that you can ignore.) – Donal Fellows Feb 27 '12 at 00:44
1

You could capture the street type and then check afterwards if the captured content is in the street type list.

The regex would become:

/^\d+\s+.*\s+(.*)

or

/^\d+\s+.*\s+(?P<streettype>.*)

Ioan Alexandru Cucu
  • 11,981
  • 6
  • 37
  • 39
1

Use capture groups. I am not sure about JS, but in java you do:

/^(\d+)\s+(.*)(\w+)/ig

And you can get the content of the groups between brackets (with Matcher.getGroup(int)).

Later, you match those strings against your database.

Anyway.... why? Maybe the street types justify it, but restricting the street names only adds more work for you and is an inconvenience for the user (if the name of the street is not exactly as it is in your database, or if your database is not updated enough). Want the user to put his direction? if the user does not want to, he can supply fake data. The user wants you to have his direction? Then you can trust that the user will be able to write it right...

SJuan76
  • 24,532
  • 6
  • 47
  • 87
  • My use case is very different from the one you are assuming. This isn't related to restricting a user's input. The end goal is to take address data that has errors and correct the errors. – Chris Dutrow Feb 27 '12 at 00:25
  • JavaScript capturing groups are much like Java's. However, on the issue of whether or not to validate user input, I disagree. The best systems detect errors in user input and, rather than failing silently or simply computing a bad answer, inform the user that they've made the mistake, possibly suggesting corrections. Google and many other popular sites offer query completion, and Office and web browsers offer spellcheck. If you want to design good software, you have to conform to your users' expectations. And users expect that software will point out mistakes and offer to fix them. – Adam Mihalcin Feb 27 '12 at 00:26
  • 1
    @AdamMihalcin I am pretty sure that the error rate of users that fail so much while writting their own address that you can't send them mail is low enough that it will sheldom compensate for the effort of mantaining this code. And introduces more trouble when the address is not listed, you want to use the system in other countries, etc. And I do think that users are stupid, but just to some extent... – SJuan76 Feb 27 '12 at 00:33