0

I have a dataframe with different columns like :
1) cust mobile phone no
2) cust home phone
3) cust nextkin phone
4) cust fax
5) cust id

In my output dataframe I would like to have columns like :
1) cust id
2) cust phone 1
3) cust phone 2
4) cust phone 3
5) cust phone 4

The mapping between input and output phone numbers is as follows (but there is also a priority logic):

cust phone 1 = cust mobile phone no    
cust phone 2 = cust home phone    
cust phone 3 = cust nextkin phone    
cust phone 4 = cust fax 

Note that any of these could be blank in input dataframe. The priority logic says that if one is blank , then next available phone number should assigned to that phone column. So if cust phone 2 is blank but cust phone 3 is available, then cust phone 2 should be assigned the value and so on. Also, cust phone 1 to cust phone 4 should all be unique (no duplicates).

Since the dataframe is large, iterating through rows is not an option.

Here is a sample data frame:

df = pd.DataFrame({'cust mobile no': ['1', '2', '3'],
                  'cust home phone': [np.nan, '2', 'x'],
                  'cust nextkin phone': ['1', '2', 'g'],
                  'cust fax': [np.nan, '4', '5'],
                  'cust id': ['001', '002', '003']})

  cust mobile no cust home phone cust nextkin phone cust fax cust id
0              1             NaN                  1      NaN     001
1              2               2                  2        4     002
2              3               x                  g        5     003

Expected output :

  cust id cust phone 1 cust phone 2 cust phone 3 cust phone 4
0     001            1          NaN          NaN          NaN
1     002            2            4          NaN          NaN
2     003            3            x            g            5
PriyankaJ
  • 339
  • 4
  • 19

1 Answers1

0

First define a function that implements the logic you need using all four columns:

from itertools import zip_longest
input_keys = ["cust mobile no", "cust home phone", "cust nextkin phone", "cust fax"]
output_keys = [f"cust phone {n}" for n in range(1, 5)]

def assign_phone_nrs(row): 
    l = [row[k] for k in input_keys if row[k] != "nan"] # get columns != 'nan'
    l = list(dict.fromkeys(l).keys())  # remove duplicates, keep order 
    output_phone_nrs = dict(zip_longest(output_keys, l, fillvalue=np.nan))  # pad with nans & put into dict
    output_phone_nrs["cust id"] = row["cust id"]   # add original id
    return pd.Series(output_phone_nrs) 

Now apply that to your input dataframe:

>>> df.apply(assign_phone_nrs, axis=1)                                                                                                                                              
  cust phone 1 cust phone 2 cust phone 3 cust phone 4 cust id
0            1          NaN          NaN          NaN     001
1            2            4          NaN          NaN     002
2            3            x            g            5     003
arnaud
  • 3,293
  • 1
  • 10
  • 27