-1

I'm new to python and i'm not sure where to start with wrangling my dataset, i have customer e-commerce sales data and need one of the columns to contain the county part of the address. The county is in most cases already in the Address4 column but some of the customers have filled in their county in Address1, Address2 or Address3 instead.

I have an array of all 32 counties so i think i need to check each column Address1, Address2,Address3,Address4 to see if they contain one of the counties in the array and if found, write the found county in Address4. Hope this makes sense.

enter image description here

in this case, i would need to find longford from Address3 and write over Co Longford and find Donegal in Address4 and write over Donegal in Address 4 and the same for all rows,

Peter Snee
  • 129
  • 1
  • 11
  • If you can be a bit clearer with your question and give a reproducible example and a desired output, i am pretty sure you'll get an answer in no time. – sophocles Jan 11 '21 at 19:38
  • I need to iterate through the entire dataframe and check if a string from my county Array exists in any of columns Address1, Address2, Address3 and Address4, and if it exists then write the string in Address4 – Peter Snee Jan 11 '21 at 19:49
  • I'll be home in 20 minutes and try to give an answer. – sophocles Jan 11 '21 at 20:14

1 Answers1

2

Say this is your df

df=pd.DataFrame({'Address1':[np.nan, 'jkl','pol','city1', np.nan],'Address2':['lop',np.nan,'pola',np.nan, np.nan],'Address3':[np.nan, np.nan,np.nan,'city13', np.nan],'Address4':[np.nan, np.nan,np.nan,np.nan,'shaka']})

    Address1 Address2 Address3 Address4
0      NaN      lop      NaN      NaN
1      jkl      NaN      NaN      NaN
2      pol     pola      NaN      NaN
3    city1      NaN   city13      NaN
4      NaN      NaN      NaN    shaka

and say this is the list of your addresses

listofAdress=['lop','jkl','pola','city13']

Create a temp column with the list of cities in each row excluding column addresses4

df['temp']=df.iloc[:,:3].values.tolist()

Convert map temp and listofAdress into sets and intersect and then drop temp

  df=df.assign(Address4=np.where(df.Address4.isna(),(df['temp'].map(set).apply(lambda x:x.intersection(set(listofAdress)))).map(list).str[0],df.Address4)).drop('temp',1)




      Address1 Address2 Address3 Address4
0      NaN      lop      NaN      lop
1      jkl      NaN      NaN      jkl
2      pol     pola      NaN     pola
3    city1      NaN   city13   city13
4      NaN      NaN      NaN    shaka
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Thanks, I'm getting an error though. :1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy customerAddress['temp'] = customerAddress.iloc[:,:3].values.tolist() – Peter Snee Jan 11 '21 at 21:15
  • It is working for me. What version of pandas are you on? Anyway it shouldnt matter. If you have all your cities in the list, replace `df.iloc[:,:3].values.tolist()` with `df['temp']=df.values.tolist()` – wwnde Jan 11 '21 at 21:22
  • Hey soz, it was a warning, not an error, apparently a common one too. that worked for most of them, would there be a way i could check if the string in address1,2,3 and 4 contains a city from my list anywhere in the string, for example if a customer had put "Dublin 2" instead of "Dublin"? – Peter Snee Jan 11 '21 at 22:01