0

I have a data frame of the following type:

date       ID1   ID2    sum
2017-1-5    1     a     200
2017-1-5    1     b     150
2017-1-5    2     a     300
2017-1-4    1     a     200
2017-1-4    1     b     120
2017-1-4    2     a     300
2017-1-3    1     b     150

I'm trying to compare between columns combinations over different dates to see if the sum values are equal. So, in the above-mentioned example, I'd like the code to identify that the sum of [ID1=1, ID2=b] combination is different between 2017-1-5 and 2017-1-4 (In my real data I have more than 2 ID categories and more than 2 Dates).

I'd like my output to be a data frame which contains all the combinations that include (at least one) unequal results. In my example:

date       ID1   ID2    sum
2017-1-5    1     b     150
2017-1-4    1     b     120
2017-1-3    1     b     150

I tried to solve it using loops like this: Is there a R function that applies a function to each pair of columns with no great success.

Your help will be appreciated.

Community
  • 1
  • 1
staove7
  • 560
  • 5
  • 18
  • What do you want your output to look like? If you have more than two dates for a combination of IDs, do you just want to indicate that they are all the same or not? Or, do you want to identify the unique ones, or what? – aichao Jan 12 '17 at 16:05
  • I want to **flag** the combination where the sum weren't equal; e.g., a data frame which will look like this: (rows; `date`) and (columns; `ID1`, `ID2`, `sum`)... so in my example: (1st row=`2017-1-5`) (2nd row=`2017-1-4`) and (1st col=`ID1`) (2nd col=`ID2`) (3rd col=`sum`)... and the 2 cells will show 150 and 120. Hope it was clear.. I might change my question.. – staove7 Jan 12 '17 at 16:14
  • What if you have three dates and two of them have the same sum, but the third does not? Which of the two do you want to flag in your output then? – aichao Jan 12 '17 at 16:17
  • I changed my question (twice :)). I hope it's more clear now. Thank you very much! – staove7 Jan 12 '17 at 16:21

1 Answers1

0

Using dplyr, we can group_by_(.dots=paste0("ID",1:2)) and then see if the values are unique:

library(dplyr)
res <- df %>% group_by_(.dots=paste0("ID",1:2)) %>% 
              mutate(flag=(length(unique(sum))==1)) %>%
              ungroup() %>% filter(flag==FALSE) %>% select(-flag)

The group_by_ allows you to group multiple ID columns easily. Just change 2 to however many ID columns (i.e., N) you have assuming that they are numbered consecutively from 1 to N. The column flag is created to indicate if all of the values are the same (i.e., number of unique values is 1). Then we filter for results for which flag==FALSE. This gives the desired result:

res
### A tibble: 3 x 4
##      date   ID1   ID2   sum
##     <chr> <int> <chr> <int>
##1 2017-1-5     1     b   150
##2 2017-1-4     1     b   120
##3 2017-1-3     1     b   150
aichao
  • 7,375
  • 3
  • 16
  • 18
  • Amazing. Thanks! One more thing.. Is there a way to `flag` only differences bigger than sum integer; e.g., bigger than 50 (which in my example would produce no results)? – staove7 Jan 12 '17 at 16:56
  • If you are interested in flagging a range of `sum` values that is greater than `50` in each group, then you can use `flag=(max(sum)-min(sum))<=50` instead of `flag=(length(unique(sum))==1)` inside of `mutate`. This will set those groups where the range of `sum` values is less than or equal to `50` to `TRUE` so that only those with groups with ranges that are greater than `50` will be kept. – aichao Jan 12 '17 at 17:05
  • Amazing again. Thank you very very much! – staove7 Jan 15 '17 at 07:55