2

I've been working with data initially exported to CSV and later imported from same CSV for further EDA. There's an address column which has 'suburb/locality' name attached to it. I was trying to find a way to split/extract those specific suburb name into a different column using excel. But i am not getting desired output. Would be helpful to know if I can do this using Python(NLTK) function?

Here is my sample data.

**Address column**
4a Mcarthurs Road, Altona north
1 Neal court, Altona North 
4 Vermilion Drive, Greenvale
Lot 307 Bonds Lane, Greenvale
430 Blackshaws rd, Altona North 
159 Bonds lane, Greenvale
Lot 1105 4 compass Drive Greenvale
6005 Bethany dr tarneet
Lot 655 Potofino Way Wollert
lot 403 Binds Lane, Greenvale
157 Maidstone street Altona
11 Laramie Street, Greenvale 
10 Preveli Way Wollert 
21 Laramie Street, Greenvale 
20 taipan crt tarneit
4 bisect road greenvale
83 everton road truganina
Lot 450 Vermilion Drive, Greenvale
Lot 641 Preveli Way Wollert 
648 hogans rd tarneit

Desired output:

Address                   Suburb
4a Mcarthurs Road        Altona North
1 Neal court              Altona North
4 Vermilion Drive          Greenvale
Lot 307 Bonds Lane         Greenvale
430 Blackshaws rd         Altona North
159 Bonds lane              Greenvale
Lot 1105 4 compass Drive    Greenvale
6005 Bethany dr              Tarneet
Lot 655 Potofino Way         Wollert
lot 403 Binds Lane          Greenvale
157 Maidstone street         Altona
11 Laramie Street          Greenvale
10 Preveli Way              Wollert
21 Laramie Street           Greenvale
20 taipan crt               Tarneit
4 bisect road              Greenvale
83 everton road            Truganina
Lot 450 Vermilion Drive    Greenvale
Lot 641 Preveli Way          Wollert
648 hogans rd               Tarneit

Any help on this will be highly appreciated.

Thank you in advance for the support!

adey27
  • 439
  • 3
  • 19

2 Answers2

2

You can try this:

df['local'] = df['Address column']\
                .str.extract(r'.+\, (.*)')\
                .fillna(df['Address column'].str.extract(r'.* (.*)$'))

print(df['local'])
0     Altona north
1     Altona North
2        Greenvale
3        Greenvale
4     Altona North
5        Greenvale
6        Greenvale
7          tarneet
8          Wollert
9        Greenvale
10          Altona
11       Greenvale
12         Wollert
13       Greenvale
14         tarneit
15       greenvale
16       truganina
17       Greenvale
18         Wollert
19         tarneit
Name: local, dtype: object
ashkangh
  • 1,594
  • 1
  • 6
  • 9
  • Thanks for your suggestion @ashkangh. but i am getting lot of NaNs in output. not sure why though. Also, original Address column still has suburb value. – adey27 Sep 18 '21 at 01:55
2

I noticed you can use the following regex pattern to split address column into address and suburb:

number word word split here word

(df["Address Column"]
.str.extract("(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)")
.apply(lambda x: x.str.title()))

Output:

                     Address        Suburb
0          4A Mcarthurs Road  Altona North
1               1 Neal Court  Altona North
2          4 Vermilion Drive     Greenvale
3         Lot 307 Bonds Lane     Greenvale
4          430 Blackshaws Rd  Altona North
5             159 Bonds Lane     Greenvale
6   Lot 1105 4 Compass Drive     Greenvale
7            6005 Bethany Dr       Tarneet
8       Lot 655 Potofino Way       Wollert
9         Lot 403 Binds Lane     Greenvale
10      157 Maidstone Street        Altona
11         11 Laramie Street     Greenvale
12            10 Preveli Way       Wollert
13         21 Laramie Street     Greenvale
14             20 Taipan Crt       Tarneit
15             4 Bisect Road     Greenvale
16           83 Everton Road     Truganina
17   Lot 450 Vermilion Drive     Greenvale
18       Lot 641 Preveli Way       Wollert
19             648 Hogans Rd       Tarneit

Note: I am sure this regex could be a lot neater.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Hi Scott, thanks for your suggestion. But i am getting this error "KeyError: "['Suburb'] not in index"". Am i missing anything? – adey27 Sep 18 '21 at 00:37
  • @adey27 What command are you using? 'Suburb' is just a label in this command, you should see a key error with this statement. After this statement you need to check your dataframe column headers. – Scott Boston Sep 18 '21 at 00:38
  • Ohk, you are correct. when I try to retrieve all column names using this command; list(df.columns.values.tolist()). "Suburb" is not appearing. how can I add it as a new column? – adey27 Sep 18 '21 at 00:42
  • Oh... you need to assign this command to the dataframe `df[['Address', 'Suburb']] = ` – Scott Boston Sep 18 '21 at 00:46
  • Tried this command : df[['Address', 'Suburb']] = (df["Address"] .str.extract(r"(?P
    .*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P.*$)") .apply(lambda x: x.str.title()))
    – adey27 Sep 18 '21 at 01:17
  • I am getting NaNs after running that command. – adey27 Sep 18 '21 at 01:19
  • `df.assign(**)` – Scott Boston Sep 18 '21 at 01:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/237227/discussion-between-adey27-and-scott-boston). – adey27 Sep 18 '21 at 01:43