1

Let's say that I have some customer data over some dates and I want to see if for example their address has changed. Over those dates. Ideally, i'd like to copy the two columns where the changes occurred into a new table or just get a metric for the amount of total changes.

So, if I had a table like

Date , Customer , Address
12/31/14, Cust1, 12 Rocky Hill Rd
12/31/15, Cust1, 12 Rocky Hill Rd
12/31/16, Cust1, 14 Rocky Hill Rd
12/31/14, Cust2, 12 Testing Rd
12/31/15, Cust2, 12 Testing Ln
12/31/16, Cust2, 12 Testing Rd

I would end up with a count of two changes, Customer 1's change between12 Rocky Hill Rd between 12/31/15 and 12/31/16 and Cust2's change between 12/31/14 and 12/31/15.

Ideally I could get a table like this

Dates , Customer , Change
12/31/15 to 12/31/16, Cust1, 12 Rocky Hill Rd to 14 Rocky Hill Rd
12/31/14 to 12/31/15, Cust2, 12 Testing Rd to 12 Testing Ln

Or even just a total count of changes would be great. Any ideas? Ideally, i'd have any more dates, possibly multiple changes across those dates, and potentially additional columns i'd like to be checking for changes in as well. Really, just a summation of changes to a customer record over some date period for each column would suffice.

I'm new to Panda's and not really sure exactly where to start on this.

Edit: As I note on the solution below, i'd like to be able to pass a larger dataframe with more than just an address to detect changes. In example, I've accomplished this in R with something like the following: `enter code here

`#How many changes have occured (unique values - 1)
UniLen <-  function(x){
  x <- length(unique(x))-1
  return(x)
}
#Create a vector of Address Features to check for changes in
Address_Features <- c("AddrLine1", "AddrLine2", "AddrLine3", "CityName", "State", "ZipCodeNum", "County")
#Check for changes in each address 'use this address for description' for each customer
AddressChanges_Detail <- mktData[,c("CustomerNumEID","AddressUniqueRelationDesc",Address_Features)] %>%
  group_by(CustomerNumEID, AddressUniqueRelationDesc) %>%
  summarise_each(funs(UniLen))

#Summarise results (how many changes for each feature)
AddressChanges_Summary <- AddressChanges_Detail[,Address_Features] %>%
  summarise_each(funs(sum))

This allows us to count how many changes occur, but i'm missing out on the date the change occurred and what the feature was changed from and to... It seems the Python solution you've proposed solves for that with the use of .shift instead of just a summary of unique values on some group. Ideally i'd like the best of both worlds :).

sc305495
  • 249
  • 3
  • 11

1 Answers1

1
df

Input dataframe

    Date    Customer    Address
0   12/31/14    Cust1   12 Rocky Hill Rd
1   12/31/15    Cust1   12 Rocky Hill Rd
2   12/31/16    Cust1   14 Rocky Hill Rd
3   12/31/14    Cust2   12 Testing Rd
4   12/31/15    Cust2   12 Testing Ln
5   12/31/16    Cust2   12 Testing Rd

Address change function:

def changeAdd(x):
    x=x[x.Address != x.shift(-1).Address]
    df1 = pd.DataFrame({'Date':x.shift(1).Date + ' to '+ x.Date,
              'Customer':x.Customer.max(),
              'Address':x.shift(1).Address +' to ' + x.Address})
    return df1[df1.Address.notnull()]


dm = df.groupby('Customer')\
   .apply(changeAdd)\
   .reset_index(drop=True)[['Date','Customer','Address']]

dm

Output dataframe:

Date    Customer    Address
0   12/31/15 to 12/31/16    Cust1   12 Rocky Hill Rd to 14 Rocky Hill Rd
1   12/31/14 to 12/31/15    Cust2   12 Testing Rd to 12 Testing Ln
2   12/31/15 to 12/31/16    Cust2   12 Testing Ln to 12 Testing Rd
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • That seems to have done the trick, thanks! I'm going to look into parametrizing the function a bit so that I could apply it to different groupings or columns more easily. I will post the result if I get to it. – sc305495 Mar 22 '17 at 21:10
  • Hi Scott, can you see my post edit and potentially provide some additional advice? I'm at a bit of a loss as to how to best 'parameterize' your solution. – sc305495 Mar 27 '17 at 18:42
  • Exactly what would you like to parameterize? The column/column(s) that are monitored for changing? What are the rules for a change? If both change or one or the other changes? etc... – Scott Boston Mar 27 '17 at 19:54
  • My intent is to generate metrics for each column independently (without writing a new function for each column). I'm looking at a bunch of customer related data (name, phone#, email, address, etc.) and i'd like to be able to pass a dataset where I'm grouping on say a customer # and seeing how many times that customers data is changing independently. In an ideal world, i'd be able to generate a dataset similar to what you've done for each feature where I know what the previous and new values are and what the previous and new date are. I've bee able to do this in R with uniqueValuesAndSummary – sc305495 Mar 27 '17 at 20:11
  • def changeAdd(x,incolumn): x=x[x[incolumn] != x.shift(-1)[incolumn]] df1 = pd.DataFrame({'Date':x.shift(1).Date + ' to '+ x.Date, 'Customer':x.Customer.max(), 'Address':x.shift(1)[incolumn] +' to ' + x[incolumn]}) return df1[df1[incolumn].notnull()] – Scott Boston Mar 27 '17 at 20:32
  • changecolumn = 'Address' dm = df.groupby('Customer')\ .apply(changeAdd,changecolumn)\ .reset_index(drop=True)[['Date','Customer','Address']] dm – Scott Boston Mar 27 '17 at 20:32
  • Maybe like that. You'll need to modify the logging 'Address' also. Good luck! – Scott Boston Mar 27 '17 at 20:33
  • That seems to have done it! Thank you Scott, this has been a great learning experience for me... I was struggling with some pretty simple concepts but I've got it down for the most part now. I ended up wraping the dm = ... in a for loop and then looping through an array of column names. Sort of like this: – sc305495 Mar 28 '17 at 15:53
  • AddrChangeFeatures = ['AddrLine1', 'AddrLine2']\ results = {}\ for i in range(len(AddrChangeFeatures)):\ changecol = AddrChangeFeatures[i]\ print("Processing: ", changecol)\ dm = AddrData.groupby(['CustomerNumEID', 'AddressUniqueRelationDesc']).apply(changeAdd, changecol).reset_index(drop=True)[['Date', 'CustomerNumEID', changecol]]\ print(dm)\ print("Finished Processing: ", changecol)\ # Store results in a list\ results[i] = dm – sc305495 Mar 28 '17 at 15:54
  • You're welcome. It also has been a good learning experience. – Scott Boston Mar 28 '17 at 15:58