1

I have a dataframe with 6000 records and need to extract/split the column with streetname into: "Streetname", "Housingnumber" and "Adjectives". Unfortunately, the problem is not solved yet using regex functions because there is no structure in the notation of df["streetname"]:

**Input from df["Streetname"]**

St. edward's Lane 26

Vineyardlane3a

High Street 0-9

ParkRoad near #33

Queens Road ??

s-Georgelane9abc

Kings Road 9b

1st Park Avenue 67 near cyclelane 

**Output that I would like:

df["Street"]                    df["housingnumber"]             df["adjective"]**

St. Edward's lane               26

Vineyardlane                    3                               a

High Street                     0-9

ParkRoad                        33

Queens Road                    

s-Georgelane                    9                               abc

Kings Road                      9                               b 

1st Park Avenue                 67

I tried this:

Filter = r'(?P<S>.*)(?P<H>\s[0-9].*)'

df["Streetname"] = df["Streetname"].str.extract(Filter)

I lose a lot of data and the result is only written into one column... Hope that someone can help!

Rakesh
  • 81,458
  • 17
  • 76
  • 113
Levi
  • 11
  • 1

1 Answers1

0

Not 100% perfect (I doubt that this will be possible without a database or machine learning algorithms) but a starting point:

^                         # start of line/string
(?P<street>\w+?\D+)       # [a-zA-Z0-9_]+? followed by not a number
(?P<nr>\d*[-\d]*)         # a digit, followed by - and other digits, eventually
(?P<adjective>[a-zA-Z]*)  # a-z
.*                        # consume the rest of the string

See a demo on regex101.com.


You might want to strip of #, whitespaces or ? from the end of street afterwards.
Jan
  • 42,290
  • 8
  • 54
  • 79