6

I have a sample dataframe that I am working with

ID <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
TARG_AVG <- c(2.1,2.1,2.1,2.1,2.1,2.1,2.3,2.3,2.5,2.5,2.5,2.5,3.1,3.1,3.1,3.1,3.3,3.3,3.3,3.3,3.5,3.5)
Measurement <- c("Len","Len","Len","Wid","Ht","Ht","Dep","Brt","Ht","Ht","Dep","Dep"
                 ,"Dep","Dep","Len","Len","Ht","Ht","Brt","Brt","Wid","Wid")
df1 <- data.frame(ID,TARG_AVG,Measurement)

I am trying to solve 3 different problems here

1) I want to get the summary of how many unique measurements are there for (ID & TARG_AVG) grouping. I currently do this

unique <- summaryBy(Measurement~ID+TARG_AVG, data=df1, FUN=function(x) { c(Count=length(x)) } ) 

This gives me the total (measurement.count) but I want the counts for each measurements too. My desired output is

  ID TARG_AVG Len Wid Ht Dep Brt Measurement.Count
1  A      2.1   3   1  2   0   0                 6
2  A      2.3   0   0  0   1   1                 2
3  A      2.5   0   0  2   2   0                 4
4  B      3.1   2   0  0   2   0                 4
5  B      3.3   0   0  2   0   2                 4
6  B      3.5   0   2  0   0   0                 2

2) Once I get the above output, I would like to subset the rows so that I get a filtered output that returns rows that have at least a count of 2 measurements > 2 . Here my desired output would be

  ID TARG_AVG Len Wid Ht Dep Brt Measurement.Count
1  A      2.1   3   1  2   0   0                 6
3  A      2.5   0   0  2   2   0                 4
4  B      3.1   2   0  0   2   0                 4
5  B      3.3   0   0  2   0   2                 4

3) Finally, I would like to pivot back the columns into rows with only measurements > 2. My desired output here would be

      ID TARG_AVG Measurement
    1  A      2.1   Len   
    2  A      2.1   Len   
    3  A      2.1   Len   
    4  A      2.1   Ht   
    5  A      2.1   Ht   
    6  A      2.5   Ht   
    7  A      2.5   Ht   
    8  A      2.5   Dep  
    9  A      2.5   Dep  
   10  B      3.1   Len  
   11  B      3.1   Len  
   12  B      3.1   Dep 
   13  B      3.1   Dep
   14  B      3.3   Ht 
   15  B      3.3   Ht 
   16  B      3.3   Brt 
   17  B      3.3   Brt 

I am learning reshape2, dplyr & data.table packages at the moment and would be very useful if someone would help me solve this by pointing me in the right direction.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Sharath
  • 2,225
  • 3
  • 24
  • 37

3 Answers3

5

Newest solution

library(data.table) #v 1.9.6+
setDT(df1)[, indx := .N, by = names(df1)
           ][indx > 1, if(uniqueN(Measurement) > 1) .SD, by = .(ID, TARG_AVG)]
#     ID TARG_AVG Measurement indx
#  1:  A      2.1         Len    3
#  2:  A      2.1         Len    3
#  3:  A      2.1         Len    3
#  4:  A      2.1          Ht    2
#  5:  A      2.1          Ht    2
#  6:  A      2.5          Ht    2
#  7:  A      2.5          Ht    2
#  8:  A      2.5         Dep    2
#  9:  A      2.5         Dep    2
# 10:  B      3.1         Dep    2
# 11:  B      3.1         Dep    2
# 12:  B      3.1         Len    2
# 13:  B      3.1         Len    2
# 14:  B      3.3          Ht    2
# 15:  B      3.3          Ht    2
# 16:  B      3.3         Brt    2
# 17:  B      3.3         Brt    2

Or the dplyr equivalent

df1 %>%
  group_by(ID, TARG_AVG, Measurement) %>%
  filter(n() > 1) %>%
  group_by(ID, TARG_AVG) %>%
  filter(n_distinct(Measurement) > 1)

Older solution

library(data.table)
## dcast the data (no need in total)
res <- dcast(df1, ID + TARG_AVG  ~ Measurement)
## filter by at least 2 incidents of at least length 2
res <- res[rowSums(res[-(1:2)] > 1) > 1,]
## melt the data back and filter again by at least 2 incidents
res <- melt(setDT(res), id = 1:2)[value > 1]
## Expand the data back
res[, .SD[rep(.I, value)]]

The solution to the original question

Here's a possible solution using reshape2

1st step

library(reshape2)
res <- dcast(df1, ID + TARG_AVG  ~ Measurement, margins = "Measurement")

2nd step

res <- res[res$"(all)" > 2,]

3d step

library(data.table)
setDT(df1)[, if(.N > 2) .SD, by = .(ID, TARG_AVG)]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks for the solution but one thing I wanted to point out is that I really am not filtering the Total by (>2). I really want to filter the dataset based on measurements (I.e: Only if 2 of those measurements are above 2, I want to include the row. For example if the total is 5 and the combination of measurements is 3+1+1, then I dont want to include the row since only one measurement is above 2. Could you please check that? – Sharath Oct 08 '15 at 00:30
  • My above df1 example might not be the best here since the filter can be applied to total itself rather than measurements. I am sorry for providing such an example. – Sharath Oct 08 '15 at 00:30
  • @Sharath I had `which(res[, c(3:7)] >= 2, arr.ind = TRUE) -> ind; res[unique(ind[,1]),] %>% arrange(ID, TARG_AVG) ` for the 2nd part. I wonder if this is what you meant. – jazzurro Oct 08 '15 at 00:34
  • If you only need the last output why do you need all these steps before, you can reach it directly like in the other answer – David Arenburg Oct 08 '15 at 00:37
  • jazzuro, thanks for the post but the prob is it is not filtering based on 2 measurements. I need a filter that returns rows with atleast 2 measurements that are >= 2. – Sharath Oct 08 '15 at 00:37
  • See my edit regarding the third step, does this wok for you? You can change it to >= 2 in your real data set – David Arenburg Oct 08 '15 at 00:40
  • @David, Yes you are right but the problem is that it is filtering based on total I believe because I see the 4th row (Wid) appearing in your solution. this is what I said earlier that I want to filter only based on measurements that are more than 2 which is what I am not knowing how to do. The total rows should be 17 and not 18. Could you please check that?. Please let me know if I am not clear. – Sharath Oct 08 '15 at 00:43
  • David - Such a fantastic solution. This was so easy to understand. It works like charm. I am going to apply to a bigger dataset that I have and get back to you. – Sharath Oct 08 '15 at 01:06
  • 1
    I've added another solution, try it too. – David Arenburg Oct 08 '15 at 01:07
  • 1
    Wow David. It worked beautifully on my large dataset and was super fast too. This is some amazing stuff. Thank you so much for being patient and helping me out. – Sharath Oct 08 '15 at 01:24
  • See new update, I've improved the solution a bit further but you'll need the newest `data.table` version from CRAN (v 1.9.6) in order to use `uniqueN`. Also added a `dplyr` equivalent. – David Arenburg Oct 08 '15 at 06:13
1

You don't need tidyr in this case. You only need dplyr:

df2 <- df1 %>%
  group_by(ID, TARG_AVG) %>% # Group by ID and TARG_AVG
  mutate(count=n()) %>%      # Count how many are there for each combination of ID and TARG_AVG
  filter(count > 2) %>%      # Only keep the ones with more than 2 (I think you meant > 2)
  select(-count)             # Remove the auxiliary variable count
df2

A shorter (though less easily understandable) version is:

df2 <- df1 %>%
  group_by(ID, TARG_AVG) %>%
  filter(n() > 2)
df2

In this case I used the n() function directly instead of generating the auxiliary count variable.

EDIT: If you really want all three steps with dplyr and tidyr, you can do:

ID <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
TARG_AVG <- c(2.1,2.1,2.1,2.1,2.1,2.1,2.3,2.3,2.5,2.5,2.5,2.5,3.1,3.1,3.1,3.1,3.3,3.3,3.3,3.3,3.5,3.5)
Measurement <- c("Len","Len","Len","Wid","Ht","Ht","Dep","Brt","Ht","Ht","Dep","Dep"
                 ,"Dep","Dep","Len","Len","Ht","Ht","Brt","Brt","Wid","Wid")
df0 <- data.frame(ID,TARG_AVG,Measurement)

Steps 1 & 2. Summarise, count, filter by number of measurements and spread

df1 <- df0 %>%
  group_by(ID, TARG_AVG, Measurement) %>%
  summarise(count=n()) %>%
  group_by(ID, TARG_AVG) %>% # Step "2"
  filter(n() >= 2) %>%       # Step "2"
  spread(Measurement, count, fill = 0) %>% # Resume step "1"
  mutate(Measurement.count = Len + Wid + Ht + Dep + Brt)
df1

Step 3. Reshape again

df3 <- df2 %>%
  select(-Measurement.count) %>%
  gather(Measurement, dummy, Brt:Wid) %>%
  select(-dummy)
df3
Felipe Gerard
  • 1,552
  • 13
  • 23
  • This solves the third problem directly. If you want the intermediate steps then you'll have to stick to what you have or replicate it with `dplyr` and `tidyr` ^^ – Felipe Gerard Oct 07 '15 at 23:32
  • But someone already did it u.u If you _really_ want a version using `dplyr` and `tidyr`, I can post it. `reshape` and `reshape2` are the past! – Felipe Gerard Oct 08 '15 at 00:44
  • Felipe. Could you post a solution for dplyr and tidyr? and please make sure you are applying filters based on measurements and not total. I want the filter to return rows that has atleast 2 measurements >= 2 and not the total >2. Please check that. – Sharath Oct 08 '15 at 00:46
  • I would then recommend doing the filter in step 1. The thing is that you would have to write explicitly which variables you want to check for measurements, as opposed to simply grouping in a different manner. – Felipe Gerard Oct 08 '15 at 01:04
  • Felipe - This works great with dplyr too. I will apply your solution to a bigger dataset and will let you know how that goes. – Sharath Oct 08 '15 at 01:07
  • :) just be sure to use the last update (the one with the 2 group_by statements), because that is the one that filters by number of variables with a measurement as opposed to the total. – Felipe Gerard Oct 08 '15 at 01:10
  • Felipe - I tried your solution on the bigger data set too after I ran david's script and I loved it :-) dplyr has some awesome functionality and I am hooked to it. Thanks for being patient and helping me out. – Sharath Oct 08 '15 at 01:26
  • 1
    My pleasure! Hadleys new packages are intensely fast, aren't they? ^^ – Felipe Gerard Oct 08 '15 at 01:44
1

Here's a data.table solution that may be a bit faster. I've found that subsetting in j with a by can be a bit slow compared to separating the task out into two steps: [1] Add extra columns you can use to filter on (do the by here), [2] perform the filter in one shot (without a by):

> cTbl[, N := .N, .(ID, TARG_AVG, Measurement)
      ][N > 1, NMgt1 := uniqueN(Measurement) > 1, .(ID, TARG_AVG)
      ][N > 1 & NMgt1
      ][, c('N', 'NMgt1') := NULL
      ][]



    ID TARG_AVG Measurement
 1:  A      2.1         Len
 2:  A      2.1         Len
 3:  A      2.1         Len
 4:  A      2.1          Ht
 5:  A      2.1          Ht
 6:  A      2.5          Ht
 7:  A      2.5          Ht
 8:  A      2.5         Dep
 9:  A      2.5         Dep
10:  B      3.1         Dep
11:  B      3.1         Dep
12:  B      3.1         Len
13:  B      3.1         Len
14:  B      3.3          Ht
15:  B      3.3          Ht
16:  B      3.3         Brt
17:  B      3.3         Brt
> 
Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46