0

I have a pandas dataset like below:

import pandas as pd

data = {'id':  ['001', '002', '003'],
        'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
                    "1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
                    "William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
        }

df = pd.DataFrame(data)

print(df)

I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:

id  Name   addressline1 City    State   Zipcode Country
1   William J. Clare    290 Valley Dr.  Casper  WY  82604   United States
2   null    1180 Shelard Tower  Minneapolis MN  55426   United States
3   William N. Barnard  145 S. Durbin   Casper  WY  82601   United States

I am learning python and from morning I am solving this. Any help will be greatly appreciated.

Thanks,

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Sushmitha
  • 111
  • 5
  • ```df = df.join(df['address'].str.split('\n', expand=True).add_prefix('new_column')).drop(['address'], axis=1)``` – Sushmitha Oct 18 '22 at 11:07
  • Tried the above command but still not able to read null if the name is not available. Schema is all over the place – Sushmitha Oct 18 '22 at 11:07
  • df = pd.DataFrame(data) Address = df['address'].str.split('\n',expand = True) print(Address) – Naina Oct 18 '22 at 11:15
  • Hi Naina, thank you for your response. But I am not able to extrapolate the right values into right columns. Thanks – Sushmitha Oct 18 '22 at 11:19
  • this is because the newline character \n is not marked at the right place in your original dataframe – Naina Oct 18 '22 at 11:20
  • for instance : William J. Clare\n290 Valley Dr.\nCasper – Naina Oct 18 '22 at 11:20
  • No, its beacause id no 2 has no name. if there is no name it should read the first column as Null or None – Sushmitha Oct 18 '22 at 11:23
  • There is pattern firstt name, addressline, city state, zipcode and country. If the name is not available it should be None. – Sushmitha Oct 18 '22 at 11:24

1 Answers1

0

Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:

df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])

Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.

df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])

Now you should have all the columns you need. You can remove the excess columns with:

df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)

Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.