0

So i have a dataframe like this

 No    fname        sname        landline        address
 1   Alphred      Thomas         123              A
 2   Peter        Jay            345              B
 3   Donald       Hook           123              A
 4   Jay          Donald         345              B
 5   Jay          Donald         123              A
 6   Haskell      Peter          123              B

Now i want to get all duplicates of landline and address together. So in the case above, the group (123,A) would be one set of recurring entities and (345,B) would be another set of recurring entities. I want to ignore (123,B) since this occurs only once.

Now for each of the duplicate groups I want to check if a single name occurs in both the fnmae and sname column. So in the case of (123,A)we want to to capture the lines where donald is appearing on both the fname and sname (basically they have to be two different rows and the two columns have to have similar names) So in the case above we would pick row 3 and 5. I want to perform one more operation on this after we pick this. Some sort of date at which name was entered check.

How do i achieve this? I tried using duplicated but this doesn't help so much for the second comparison?

CodeGeek123
  • 4,341
  • 8
  • 50
  • 79

1 Answers1

2

You can use groupby with isin for mask and then boolean indexing:

mask = df.groupby(['landline','address']).apply(lambda x: x.fname.isin(x.sname) | 
                                                          x.sname.isin(x.fname) & 
                                                            (len(x) > 1))
mask = mask.reset_index(level=['landline','address'], drop=True).sort_index()
print (mask)
0    False
1     True
2     True
3     True
4     True
5    False
dtype: bool

df1 = df[mask]
print (df1)
   No   fname   sname  landline address
1   2   Peter     Jay       345       B
2   3  Donald    Hook       123       A
3   4     Jay  Donald       345       B
4   5     Jay  Donald       123       A

EDIT: I think you can use custom function with filtering:

def f(x):
    print (x)
    mask = x.fname.isin(x.sname) | x.sname.isin(x.fname) & (len(x) > 1)
    x1 = x[mask]
    return x1


df2 = df.groupby(['landline','address']).apply(f).reset_index(drop=True)
print (df2)
   No   fname   sname  landline address
0   3  Donald    Hook       123       A
1   5     Jay  Donald       123       A
2   2   Peter     Jay       345       B
3   4     Jay  Donald       345       B
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks. The issue here is, how do i perform further operations on the unique groups of duplicates? Here i have received all duplicates. Maybe i should group by again on landline and address? I want to check the time of entry is < 1 day between two or more records within one group. Time of entry being another column. – CodeGeek123 Feb 27 '17 at 10:18
  • len(x) wouldnt necessarily give the time difference between the two entities? Do you know which gives time diff? Basically i want to check if the time difference between two records is less than 1 day – CodeGeek123 Feb 28 '17 at 09:26
  • 1
    `len(x) > 1` is for filtering out unique rows. Question about difference of date is a bit complicated, because no dates in samples. I think the best is create new question with link to this question with sample and desired output. Is ist possible? Thanks. – jezrael Feb 28 '17 at 09:30
  • this is now done! Thank you!! Please find the link : http://stackoverflow.com/questions/42759117/detect-duplicates-in-certain-columns-in-a-dataframe-perform-operations-on-thes – CodeGeek123 Mar 13 '17 at 08:22
  • 1
    Yes, but it seems it is more complicated. But I try find solution. – jezrael Mar 13 '17 at 08:56
  • I add solution, I hope it works nice. Also there was some weird negatives timedeltas, so use [this solution](http://stackoverflow.com/a/8408546/2901002) and use seconds only. – jezrael Mar 13 '17 at 09:46