0

I recently made this post about the same problem in Python, I'm interested to see what options may be available in R. I'm wondering if R has any packages that could made this a bit easier. How to tell if a value changed over dimension(s) in Pandas Let's say that I have some customer data over some dates and I want to see if for example their address has changed. Over those dates. Ideally, i'd like to copy the two columns where the changes occurred into a new table or just get a metric for the amount of total changes.

So, if I had a table like

Date , Customer , Address
12/31/14, Cust1, 12 Rocky Hill Rd
12/31/15, Cust1, 12 Rocky Hill Rd
12/31/16, Cust1, 14 Rocky Hill Rd
12/31/14, Cust2, 12 Testing Rd
12/31/15, Cust2, 12 Testing Ln
12/31/16, Cust2, 12 Testing Rd

I would end up with a count of two changes, Customer 1's change between12 Rocky Hill Rd between 12/31/15 and 12/31/16 and Cust2's change between 12/31/14 and 12/31/15.

Ideally I could get a table like this

Dates , Customer , Change
12/31/15 to 12/31/16, Cust1, 12 Rocky Hill Rd to 14 Rocky Hill Rd
12/31/14 to 12/31/15, Cust2, 12 Testing Rd to 12 Testing Ln

Or even just a total count of changes would be great. Any ideas? Ideally, i'd have any more dates, possibly multiple changes across those dates, and potentially additional columns i'd like to be checking for changes in as well. Really, just a summation of changes to a customer record over some date period for each column would suffice.

Community
  • 1
  • 1
sc305495
  • 249
  • 3
  • 11

2 Answers2

3

You can use group_by and summarise in dplyr to get the date range and count of changes as columns in a new table:

library(dplyr)
df %>%
  group_by(Customer) %>%
  summarise(dates = sprintf("%s to %s", min(Date), max(Date)),
            change.count = length(unique(Address)) - 1)

Result:

# A tibble: 2 × 3
  Customer                dates change.count
     <chr>                <chr>        <dbl>
1    Cust1 12/31/14 to 12/31/16            1
2    Cust2 12/31/14 to 12/31/16            1
ulfelder
  • 5,305
  • 1
  • 22
  • 40
  • Any thoughts on how we might expand this solution to account for multiple address changes at different dates? I think the proposed solution will provide a dates column of the min and max dates in the dataset (and unique addresses to account for any changes), but if I had say 20 rows each representing the address as of a specific date i'd like to know what the address was prior to a change and what it was changed to. This way I know The address was static for say 4 dates, then changed somewhere between say the 5th and 6th dates and so on... Any ideas? – sc305495 Mar 27 '17 at 14:32
1

For a total count of addresses you can just aggregate by the customer ID.

aggregate(Data$Address, by=list(Data$Customer), function(x) length(unique(x)))
  Group.1 x
1   Cust1 2
2   Cust2 2

This is the number of addresses. To get the number of changes subtract one.

G5W
  • 36,531
  • 10
  • 47
  • 80