3

Let's say we have the following R data.table (though a dataframe is fine in principle as well):

library(data.table)
new_dt = as.data.table(readcsv(...))
head(new_dt)
    name order_no  overlimit
1    Amy       32  TRUE
2   Jack       14  TRUE
3   Jack       16  TRUE
4   Dave       11  FALSE
5    Amy       22  TRUE
6   Jack       16  TRUE
7    Tom       19  FALSE
8  Larry       22  FALSE
9    Tom       89  FALSE
10  Dave       81  TRUE
11  Jack       67  TRUE
12   Tom       60  TRUE
13   Amy       23  FALSE
14  Jack       16  TRUE

For this task, the values of order_no is irrelevant---I would like to count the unique number of rows for name and overlimit:

name    overlimit  distinct_groups
Amy     TRUE       2
Amy     FALSE      1
Jack    TRUE       4
Jack    FALSE      0
Dave    TRUE       1
Dave    FALSE      1
...

Does one simply include more columns in the by argument?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • 1
    If you want to edit to improve it, I'd say: make it so this can be run by others in a newly opened R console by copy-pasting to make the input; and make sure the desired output shown is complete and corresponds to the input. Anyway, does the linked question not cover it? – Frank Jul 06 '17 at 10:30
  • @Frank Yes, I am trying to delete this question. Thank you – ShanZhengYang Jul 06 '17 at 10:33
  • Hm, I think you should be able to delete it now (using a link at the bottom). I can undupe it if you think that would help. – Frank Jul 06 '17 at 10:35
  • @Frank There is a slight difference though. Using the solution provided at the URL, `setkey(dt,sex,fruit)[CJ(unique(sex), unique(fruit)), .N, by=.EACHI]`, this would delete any other column in `dt`. Is there a way to keep all of the columns? Naturally, you could use this output and merge with the original data.table, but there may be a better way. Am I being clear? – ShanZhengYang Jul 06 '17 at 10:40
  • No, I don't follow. I think your data is probably malformed (seeing repeated values like this). You might want to have a look at hadley's tidy data paper to see how to split it into a few tables https://www.jstatsoft.org/article/view/v059i10 – Frank Jul 06 '17 at 11:14

2 Answers2

5

To answer your question, yes, you can just add both columns to the by argument:

dt[, .(distinct_groups = uniqueN(order_no)), by = c("Name", "Overlimit")]
JAD
  • 2,035
  • 4
  • 21
  • 35
0

If the request is to have the adjacent elements in 'overlimit' as a separate group, then we can use rleid to create the group

dt[, .(overlimit = overlimit[1], distinct_group = uniqueN(order_no)),
     .(name, grp = rleid(overlimit))][, grp := NULL][]

If it is a simple grouping and wants only combinations

dt[, .(distinct_group = .N), by = .(name, overlimit)
  ][CJ(name = dt$name, overlimit = dt$overlimit, unique = TRUE), 
      on = .(name, overlimit)][is.na(distinct_group), distinct_group := 0][]
akrun
  • 874,273
  • 37
  • 540
  • 662