1

Following on from this question, here is the input / output I want. I have some ideas though not entirely sure..

How do i detect duplicates and then among them cross check if two columns have similar values?

So I have a data frame like this.

 No  fname        sname        landline        address     time_of_move_in
 1   Alphred      Thomas         123              A        19/10/2016,00:01:00
 2   Peter        Jay            345              B        29/10/2016,00:01:00
 3   Donald       Hook           123              A        30/10/2016,00:11:00
 4   Jay          Donald         345              B        29/10/2016,00:05:00
 5   Jay          Donald         123              A        30/10/2016,00:14:00
 6   Haskell      Peter          123              B        19/10/2016,00:01:00

What i want is an output like this

 Case_Number   fname    sname    landline   address   time_diff
      1        Peter     Jay       345         B       -4 Hours
      1        Jay       Donald    345         B       4 Hours
      2        Donald    Hook      123         A       -2 Hours
      2        Jay       Donald    123         A       2 Hours

Eventually I only want to filter out any case where the time diff between the two discovered is < 3 hours.

Criteria Between any two cases detected

  1. Landline and Address should be identical

  2. If above is true then the same name must repeat in either the fname or surname between the two rows detected. (In case 1 above it is Jay and in Case 2 above it is Donald. Note if Donald repeats twice in fname then this is not a valid case)

  3. Time difference between the two <3 hours, i would like to bring directionality of time here as well eventually hence the negatives in the output set above.

Note : We do not have to display the time difference in the above format. So long as its some number/time format that is fine

Community
  • 1
  • 1
CodeGeek123
  • 4,341
  • 8
  • 50
  • 79

1 Answers1

1

You can convert timedelta to total_seconds because a bit complicated working with timedelta < 0:

df.time_of_move_in = pd.to_datetime(df.time_of_move_in, format='%d/%m/%Y,%H:%M:%S')
print (df)
   No    fname   sname  landline address     time_of_move_in
0   1  Alphred  Thomas       123       A 2016-10-19 00:01:00
1   2    Peter     Jay       345       B 2016-10-29 00:01:00
2   3   Donald    Hook       123       A 2016-10-30 00:11:00
3   4      Jay  Donald       345       B 2016-10-29 00:05:00
4   5      Jay  Donald       123       A 2016-10-30 00:14:00
5   6  Haskell   Peter       123       B 2016-10-19 00:01:00

def f(x):
    #convert 4 hours to seconds  
    hours4 = 4 * 60 * 60
    mask = x.fname.isin(x.sname) | x.sname.isin(x.fname) & (len(x) > 1)
    x1 = x[mask]
    #create unique values from x.name, insert as first column
    x1.insert(0,'Case_number', '{}{}'.format(*x.name))
    #get difference of datetimes, first value is NaN
    x1['time_diff'] = x1.time_of_move_in.diff().dt.total_seconds() 
    #get inverse difference, last value is NaN so filna NaN by value
    x1['time_diff']=x1['time_diff'].fillna(x1.time_of_move_in.diff(-1).dt.total_seconds())
    #boolean indexing
    x1 = x1[(x1['time_diff'] < hours4) & (x1['time_diff']  > -hours4)]
    return x1


df2 = df.groupby(['landline','address']).apply(f).reset_index(drop=True)
#factorize values, add 1 for start from 1
df2.Case_number = pd.factorize(df2.Case_number)[0] + 1
df2.drop(['time_of_move_in', 'No'], axis=1, inplace=True)
print (df2)
   Case_number   fname   sname  landline address  time_diff
0            1  Donald    Hook       123       A     -180.0
1            1     Jay  Donald       123       A      180.0
2            2   Peter     Jay       345       B     -240.0
3            2     Jay  Donald       345       B      240.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks. The only issue i also have is that it doesnt identify the pairs. In that it doesnt give you a common ID for both pairs? How do i do this? Since i want to present the pairs – CodeGeek123 Mar 13 '17 at 10:38
  • So `x1 = x[mask]` doesnt identify pairs? Maybe help me if you change data sample for identify problem. Is it possible? Thank you. – jezrael Mar 13 '17 at 11:20
  • Thank you.. I think that does. Just i want to put that on the output as well - as in the output dataset above. Maybe i can create another column to do that – CodeGeek123 Mar 13 '17 at 11:52