1

I have been playing around with regular expressions, but haven't had any luck yet. I need to introduce some address validation. I need to make sure that a user defined address matches this format:

"717 N 2ND ST, MANKATO, MN 56001"

or possibly this one too:

"717 N 2ND ST, MANKATO, MN, 56001"

and to throw everything else out and alert the user that it is the improper format. I have been looking at the documentation and have tried and failed with many regular expression patterns. I have tried this (and many variations) without any luck:

pat = r'\d{1,6}(\w+),\s(w+),\s[A-Za-z]{2}\s{1,6}'

This one works, but it allows too much junk because it is only making sure it starts with a house number and ends with a zip code (I think):

pat = r'\d{1,6}( \w+){1,6}'

The comma placement is crucial as I am splitting the input string by comma so my first item is the address, then city, then the state and zip are split by a space (here I would like to use a second regex in case they have a comma between state and zip).

Essentially I would like to do this:

# check for this format "717 N 2ND ST, MANKATO, MN 56001"
pat_1 = 'regex to match above pattern'
if re.match(pat_1, addr, re.IGNORECASE):
    # extract address 

# check for this pattern "717 N 2ND ST, MANKATO, MN, 56001"
pat_2 = 'regex to match above format'
if re.match(pat_2, addr, re.IGNORECASE):
    # extract address 

else:
    raise ValueError('"{}" must match this format: "717 N 2ND ST, MANKATO, MN 56001"'.format(addr))

# do stuff with address

If anyone could help me with forming a regex to make sure there is a pattern match, I would greatly appreciate it!

crmackey
  • 349
  • 1
  • 5
  • 20

4 Answers4

4

Here's one that might help. Whenever possible, I prefer to use verbose regular expressions with embedded comments, for maintainability.

Also note the use of (?P<name>pattern). This helps to document the intent of the match, and also provides a useful mechanism to extract the data, if your needs go beyond simple regex validation.

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?P<HouseNumber>\d+)\s+        # Matches '717 '
    (?P<Direction>[news])\s+       # Matches 'N '
    (?P<StreetName>\w+)\s+         # Matches '2ND '
    (?P<StreetDesignator>\w+),\s+  # Matches 'ST, '
    (?P<TownName>.*),\s+           # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+       # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                 # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    ):
    match = regex.match(item)
    print item
    if match:
        print "    House is on {Direction} side of {TownName}".format(**match.groupdict())
    else:
        print "    invalid entry"

To make certain fields optional, we replace + with *, since + means ONE-or-more, and * means ZERO-or-more. Here is a version that matches the new requirements in the comments:

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
# Goal:  '717 N 2ND ST NE, MANKATO, MN, 56001',
# Goal:  '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?P<HouseNumber>\d+)\s+         # Matches '717 '
    (?P<Direction>[news])\s+        # Matches 'N '
    (?P<StreetName>\w+)\s*          # Matches '2ND ', with optional trailing space
    (?P<StreetDesignator>\w*)\s*    # Optionally Matches 'ST '
    (?P<StreetDirection>[news]*)\s* # Optionally Matches 'NE'
    ,\s+                            # Force a comma after the street
    (?P<TownName>.*),\s+            # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+        # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                  # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    '717 N 2ND ST NE, MANKATO, MN, 56001',
    '717 N 2ND, MANKATO, MN, 56001',
    ):
    match = regex.match(item)
    print item
    if match:
        print "    House is on {Direction} side of {TownName}".format(**match.groupdict())
    else:
        print "    invalid entry"

Next, consider the OR operator, |, and the non-capturing group operator, (?:pattern). Together, they can describe complex alternatives in the input format. This version matches the new requirement that some addresses have the direction before the street name, and some have the direction after the street name, but no address has the direction in both places.

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
# Goal:  '717 2ND ST NE, MANKATO, MN, 56001',
# Goal:  '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?: # Matches any sort of street address
        (?: # Matches '717 N 2ND ST' or '717 N 2ND'
            (?P<HouseNumber>\d+)\s+      # Matches '717 '
            (?P<Direction>[news])\s+     # Matches 'N '
            (?P<StreetName>\w+)\s*       # Matches '2ND ', with optional trailing space
            (?P<StreetDesignator>\w*)\s* # Optionally Matches 'ST '
        )
        | # OR
        (?:  # Matches '717 2ND ST NE' or '717 2ND NE'
            (?P<HouseNumber2>\d+)\s+      # Matches '717 '
            (?P<StreetName2>\w+)\s+       # Matches '2ND '
            (?P<StreetDesignator2>\w*)\s* # Optionally Matches 'ST '
            (?P<Direction2>[news]+)       # Matches 'NE'
        )
    )
    ,\s+                             # Force a comma after the street
    (?P<TownName>.*),\s+             # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+         # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                   # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    '717 2ND ST NE, MANKATO, MN, 56001',
    '717 N 2ND, MANKATO, MN, 56001',
    ):
    match = regex.match(item)
    print item
    if match:
        d = match.groupdict()
        print "    House is on {0} side of {1}".format(
            d['Direction'] or d['Direction2'],
            d['TownName'])
    else:
        print "    invalid entry"
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • 1
    Wow! Had to mark this as the best answer. I did not know you could do this with a regex. I like that I can get everything back as dictionary with the `groupdict()` Would this also catch a situation where there may be a direction after a street name? Say if there were an address like "515 N 2ND ST SE, MANKATO, MN 56001"? – crmackey Jun 19 '15 at 14:49
  • Also, there may not always be a direction. Do you have any advice for that? – crmackey Jun 19 '15 at 14:50
  • 1
    One reason why I made my answer so verbose is so that you could modify it yourself. If a string is optional, replace the `+` with a `*`. If you want to add a part, just add another `(?Ppattern)` bit. – Robᵩ Jun 19 '15 at 14:51
  • 1
    This is awesome! Thank you! This actually makes some sense to me. I appreciate you taking the time to add a detailed response. – crmackey Jun 19 '15 at 15:09
  • Sorry to bug you one more time, but how could I tell it to use Direction or StreetDirection (the suffix). I think that there should ONLY be one direction always (I dont think `'717 N 2ND ST NE, MANKATO, MN, 56001'` makes sense because there is already a direction. Like either `717 N 2nd ST` for the address OR `717 2ND ST NE`. I made the Direction optional but I do not get a match with `717 2ND ST NE, MANKTAO, MN 56001`. Again, I really appreciate your awesome answer. Meanwhile, I'll try to figure it out on my own. – crmackey Jun 19 '15 at 15:27
  • 1
    If you have two very different mutually exclusive patterns (like '717 N 2ND ST' and '717 2ND ST N'), consider using the `|` operator. I'll post a version in a few minutes. – Robᵩ Jun 19 '15 at 15:32
  • Thank you! This is exactly what I needed. This is a good working example that I can refer back to and do more research on. I appreciate it! – crmackey Jun 19 '15 at 17:50
1

How about this:

((\w|\s)+),((\w|\s)+),\s*(\w{2})\s*,?\s*(\d{5}).*

You can also use it to extract the street, city, state and zip in \1, \3, \5 and \6 respectively. It'll match the last letter of the street and city separately but this doesn't affect the validity.

1
\d{1,6}\s\w+\s\w+\s[A-Za-z]{2},\s([A-Za-z]+),\s[A-Za-z]{2}(,\s\d{1,6}|\s\d{1,6})

You can test the regex in this link : https://regex101.com/r/yN7hU9/1

pchmn
  • 524
  • 1
  • 5
  • 16
1

you could use this:

\d{1,6}(\s\w+)+,(\s\w+)+,\s[A-Z]{2},?\s\d{1,6}

it will match a string that starts with a house number then any numbers of words followed by a comma. then it will look for a city name that consists of at least one word followed by a coma. next it will look for exactly 2 capital letters followed by an optional comma. then a zip code.

Buzz
  • 1,877
  • 21
  • 25