0

I have two Pandas DataFrame. I have one data frame containing 3 columns of interests, which contains IDs of customers for different products. I have a second data frame containing the names of the customers. I would like to extend the first data frame to have new columns containing the customer names as separate columns. For example, I would like to extend the main dataframe mapped with the second dataframe but with new columns like customer_1_name, customer_2_name, customer_3_name

Keeping in mind that one customer can have different customer ID packages.

In order to buttress my problem. I have enclosed a snippet of the two data frames

# DataFrame 1
dFrame_main = ({'Pageviews': [22.0,
  22.0,
  21.0,
  20.0,
  18.0,
  15.0,
  14.0,
  14.0,
  13.0,
  13.0,
  12.0,
  12.0,
  12.0,
  12.0,
  12.0,
  12.0,
  12.0,
  12.0,
  11.0,
  11.0],
 'Unique Pageviews': [8.0,
  8.0,
  16.0,
  14.0,
  14.0,
  12.0,
  13.0,
  12.0,
  13.0,
  13.0,
  8.0,
  8.0,
  5.0,
  11.0,
  12.0,
  7.0,
  9.0,
  9.0,
  5.0,
  5.0],
 'Avg. Time on Page': ['0:00:23',
  '0:00:23',
  '0:03:49',
  '0:00:31',
  '0:00:21',
  '0:00:27',
  '0:00:38',
  '0:00:15',
  '0:01:24',
  '0:00:20',
  '0:00:13',
  '0:00:13',
  '0:02:14',
  '0:00:33',
  '0:00:46',
  '0:00:14',
  '0:01:08',
  '0:01:08',
  '0:01:51',
  '0:01:51'],
 'CustomerID_1': ['465',
  '465',
  '162',
  '124',
  '920',
  '920',
  '920',
  '920',
  '920',
  '920',
  '165',
  '165',
  '166',
  '920',
  '920',
  '920',
  '162',
  '162',
  '1846',
  '118'],
 'CustomerID_2': ['702',
  '702',
  '446',
  '125',
  '470',
  '470',
  '470',
  '470',
  '470',
  '212',
  '1920',
  '1920',
  '868',
  '470',
  '470',
  '470',
  '873',
  '873',
  '862',
  '862'],
 'CustomerID_3': ['167',
  '167',
  '570',
  np.nan,
  '212',
  '212',
  '212',
  '212',
  '212',
  np.nan,
  '1670',
  '1670',
  '274',
  '212',
  '212',
  '212',
  '764',
  '764',
  '584',
  '584']})
# DataFrame 2
dFrame = pd.DataFrame({'CustomerID': [569,
  923,
  162,
  1798,
  920,
  470,
  1943,
  1798,
  162,
  124,
  1053,
  212,
  923,
  1747,
  1921,
  166,
  165,
  465,
  862,
  584],
 'CustomerNames': ['Thomas Bills',
  'Demi Boras',
  'Jerry wills',
  'Pills Wilson',
  'Jerry wills',
  'Harsh wilson',
  'Alli Pees',
  'Pills Wilson',
  'Jerry wills',
  'Pills Wilson',
  'Fedolls Feba',
  'Pills Wilson',
  'Demi Boras',
  'Harsh wilson',
  'Matt Lills',
  'Pills Wilson',
  'Twist Tells',
  'Jerry wills',
  'Matt Lills',
  'Balls tails']})

Please note: This is just a snippet of the large dataframe so that you can grasp the problem I am trying to solve.

I have tried to go this How to map one dataframe to another (python pandas)?, unfortunately, this does not help my situation.

Thanks for your time

JA-pythonista
  • 1,225
  • 1
  • 21
  • 44
  • I noticed your CustomerID in dFrame_main are objects, while CustomerID in dFrame is integers. I was thinking you could use [.join](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html?highlight=join#pandas.DataFrame.join) to join on CusomterID, and then specify the suffix. – Rick D. Mar 22 '20 at 00:37

1 Answers1

1
# Build a mapper series. For this to work we need to ensure that 
# - the datatypes are the same (string)
# - that we have no repeated values

mapper = dFrame.astype(str).drop_duplicates().set_index('CustomerID')['CustomerNames']

# Apply the mapping (the value of the column will be looked up in the index of the `mapper` series)
dFrame_main['CustomerID_1_name'] = dFrame_main['CustomerID_1'].map(mapper)
dFrame_main['CustomerID_2_name'] = dFrame_main['CustomerID_2'].map(mapper)
dFrame_main['CustomerID_3_name'] = dFrame_main['CustomerID_3'].map(mapper)
Gecko
  • 1,379
  • 11
  • 14
  • It works but I wonder why there are lots of missing values when indeed in the master sheet, all values are there – JA-pythonista Mar 22 '20 at 00:47
  • For the data you provided it does the right thing. In fact, there are a lot of NaNs, but that is because dFrame does not have all the customer ids in dFrame_main. You can check with something like all_customers = set(dFrame_main['CustomerID_1'].tolist() + dFrame_main['CustomerID_2'].tolist() + dFrame_main['CustomerID_3'].tolist()) all_customers - set(dFrame.CustomerID) A common mistake is that somewhere in your processing you are mixing up the *index* for each row with their position. For example if you use .iloc somewhere. – Gecko Mar 22 '20 at 00:56
  • After deep evaluation of the data, those missing values are as a result that some customers have the same surname or common names, and the 'drop_ducplicates affected that. How can I solve this issue? – JA-pythonista Mar 22 '20 at 13:15
  • Edited the response. By doing the drop_duplicates before setting the index we will not throw away rows that have the same name but different index. – Gecko Mar 22 '20 at 22:46
  • I get the error: ```InvalidIndexError: Reindexing only valid with uniquely valued Index objects``` – JA-pythonista Mar 23 '20 at 07:18