1

So, I have a large data frame with customer names. I used the phone number and email combined to create a unique ID key for each customer. But, sometimes there will be a typo in the email so it will create two keys for the same customer.

Like so:

Key                        | Order #
555261andymiller@gmail.com   901345
555261andymller@gmail.com    901345

I'm thinking of combining all the keys based on the phone number (partial string) and then assigning all the keys within each group to the first key in every group. How would I go about doing this in Pandas? I've tried iterating over the rows and I've also tried the groupby method by partial string, but I can't seem to assign new values using this method.

  • Clean up the email address *before* creating the keys in the first place? – scnerd Apr 17 '18 at 21:49
  • I'm working with tens of thousands of rows and I'm not sure how to clean the email addresses other than manually. – DerivativeIntegral Apr 17 '18 at 21:52
  • How do you know which one is the typo email address and which one is the correct one? Is there any type of time stamp on it? What if the guys email was actually is `andymller@..` ? – Joost Apr 17 '18 at 21:55
  • It really doesn't matter for my purposes which is correct. I just need the customer to have the same identifier. Whether it's andymiller or andymller is irrelevant. – DerivativeIntegral Apr 17 '18 at 21:57

1 Answers1

0

If you really don't care what the new ID is, you can groupby the first characters of the string (which represent the phone number)

For example:

df.groupby(df.Key.str[:6]).first()

This will result in a dataframe where the index is the the first entry of the customer record. This assumes that the phone number will always be correct, though it sounds like that should not be an issue

johnchase
  • 13,155
  • 6
  • 38
  • 64