5

I have three strings which have information of the street name and apartment number.

"32 Syndicate street", "Street 45 No 100" and "15, Tom and Jerry Street"

Here,

"32 Syndicate street" -> {"street name": "Syndicate street", "apartment number": "32"}
"Street 45 No 100" -> {"street name": "Street 45", "apartment number": "No 100"}
"15, Tom and Jerry Street" -> {"street name": "Tom and Jerry Street", "apartment number": "15"}

I am trying to use Python's regex to get the street names and apartment numbers separately. This is my current code, which is having problems:

import re 
for i in ["32 Syndicate street","Street 45 No 100","15, Tom and Jerry Street"]:
    ###--- write patterns for street names
    pattern_street = re.compile(r'([A-Za-z]+\s?\w+ | [A-Za-z]+\s?[A-Za-z]+\s?[A-Za-z]+\s? | [A-Za-z]+\s?)') 
    match_street = pattern_street.search(i) 
    
    ###--- write patterns for apartment numbers
    pattern_aptnum = re.compile(r'(^\d+\s? | [A-Za-z]+[\s?]+[0-9]+$)') 
    match_aptnum = pattern_aptnum.search(i)

    fin_street = match_street[0] ##--> final street name
    fin_aptnum = match_aptnum[0] ##--> final apartment name 

    print("street--",fin_street)
    print("apartmentnumber--",fin_aptnum)

I get the following output:

street--  Syndicate street 
apartmentnumber-- 32 
street-- Street 45 
apartmentnumber--  No 100

I have two problems:

  1. I am not able to get the apartment number "15" for the final string.
  2. Why is there a space in the beginning of street-- Syndicate street and apartmentnumber-- No 100
Srivatsan
  • 9,225
  • 13
  • 58
  • 83

2 Answers2

3

You may get the apartment number using

^\d+|\bNo\s*\d+

See the regex demo. The ^\d+|\bNo\s*\d+ regex matches either one or more digits at the start of string, or No, zero or more whitespaces and then one or more digits.

To capture the street information, you can use

^\d+,?\s*(.*)|^(.*?)\s+No\s*\d+

See this regex demo. Details:

  • ^\d+,?\s*(.*) - start of string, one or more digits, an optional comma, 0+ whitespaces and then any zero or more chars other than line break chars as many as possible captured into Group 1
  • | - or
  • ^(.*?)\s+No\s*\d+ - start of string, any zero or more chars other than line break chars as many as possible captured into Group 2, 1+ whitespaces, No, 0+ whitespaces, and then 1+ digits.

In Python, never compile regexps inside a for loop, do it before. See the Python demo:

import re 

pattern_aptnum = re.compile(r'^\d+|\bNo\s*\d+')
pattern_street = re.compile(r'^\d+,?\s*(.*)|^(.*?)\s+No\s*\d+') 
for i in ["32 Syndicate street","Street 45 No 100","15, Tom and Jerry Street"]:
    fin_street = ""
    fin_aptnum = ""
    print("String:", i)
    match_street = pattern_street.search(i)
    if match_street:
        fin_street = match_street.group(1) or match_street.group(2)
    match_aptnum = pattern_aptnum.search(i)
    if match_aptnum:
        fin_aptnum = match_aptnum.group()

    print("street--",fin_street)
    print("apartmentnumber--",fin_aptnum)

Output:

String: 32 Syndicate street
street-- Syndicate street
apartmentnumber-- 32
String: Street 45 No 100
street-- Street 45
apartmentnumber-- No 100
String: 15, Tom and Jerry Street
street-- Tom and Jerry Street
apartmentnumber-- 15
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Is there any specific reason to not call the `compile` inside the `for` loop? I am actually trying to use different `compile` statements based on the number of "variables" my string has. – Srivatsan Aug 29 '20 at 17:56
  • 1
    @Srivatsan You lose performance that way. You do not need `re.compile` if you need to build patterns from variables inside a loop, just use `re.search` directly then. – Wiktor Stribiżew Aug 29 '20 at 17:57
1
  1. Use re.compile(... , re.X) if you want to use freely white space in the regex.
  2. print() inserts a space by default between its several arguments.
Gribouillis
  • 2,230
  • 1
  • 9
  • 14