For R data.table, how to use uniqueN() in order count unique/distinct values in multiple columns?

Question

Let's say we have the following R data.table (though a dataframe is fine in principle as well):

library(data.table)
new_dt = as.data.table(readcsv(...))
head(new_dt)
    name order_no  overlimit
1    Amy       32  TRUE
2   Jack       14  TRUE
3   Jack       16  TRUE
4   Dave       11  FALSE
5    Amy       22  TRUE
6   Jack       16  TRUE
7    Tom       19  FALSE
8  Larry       22  FALSE
9    Tom       89  FALSE
10  Dave       81  TRUE
11  Jack       67  TRUE
12   Tom       60  TRUE
13   Amy       23  FALSE
14  Jack       16  TRUE

For this task, the values of order_no is irrelevant---I would like to count the unique number of rows for name and overlimit:

name    overlimit  distinct_groups
Amy     TRUE       2
Amy     FALSE      1
Jack    TRUE       4
Jack    FALSE      0
Dave    TRUE       1
Dave    FALSE      1
...

Does one simply include more columns in the by argument?

If you want to edit to improve it, I'd say: make it so this can be run by others in a newly opened R console by copy-pasting to make the input; and make sure the desired output shown is complete and corresponds to the input. Anyway, does the linked question not cover it? — Frank, Jul 06 '17 at 10:30
Hm, I think you should be able to delete it now (using a link at the bottom). I can undupe it if you think that would help. — Frank, Jul 06 '17 at 10:35
@Frank There is a slight difference though. Using the solution provided at the URL, `setkey(dt,sex,fruit)[CJ(unique(sex), unique(fruit)), .N, by=.EACHI]`, this would delete any other column in `dt`. Is there a way to keep all of the columns? Naturally, you could use this output and merge with the original data.table, but there may be a better way. Am I being clear? — ShanZhengYang, Jul 06 '17 at 10:40
No, I don't follow. I think your data is probably malformed (seeing repeated values like this). You might want to have a look at hadley's tidy data paper to see how to split it into a few tables https://www.jstatsoft.org/article/view/v059i10 — Frank, Jul 06 '17 at 11:14

score 5 · Accepted Answer · answered Jul 06 '17 at 10:03

5

To answer your question, yes, you can just add both columns to the by argument:

dt[, .(distinct_groups = uniqueN(order_no)), by = c("Name", "Overlimit")]

answered Jul 06 '17 at 10:03

JAD

2,035
4
21
35

akrun · Answer 2 · 2017-07-06T10:33:29.633

0

If the request is to have the adjacent elements in 'overlimit' as a separate group, then we can use rleid to create the group

dt[, .(overlimit = overlimit[1], distinct_group = uniqueN(order_no)),
     .(name, grp = rleid(overlimit))][, grp := NULL][]

If it is a simple grouping and wants only combinations

dt[, .(distinct_group = .N), by = .(name, overlimit)
  ][CJ(name = dt$name, overlimit = dt$overlimit, unique = TRUE), 
      on = .(name, overlimit)][is.na(distinct_group), distinct_group := 0][]

edited Jul 06 '17 at 10:33

answered Jul 06 '17 at 09:58

akrun

874,273
37
540
662

OP's desired output has zeros in the distinct_groups column; does yours? – Frank Jul 06 '17 at 10:13
@Frank I think his input and output are not correct – akrun Jul 06 '17 at 10:14
@Frank Based on the input example, `Jack TRUE` count is only 3 – akrun Jul 06 '17 at 10:17
1

Yeah, OP also earned my downvote with a nonreproducible, incomplete and somewhat poorly worded example, fwiw. But I'll keep my downvote on these two answers as well, since y'all did not bother to clarify what the OP wanted before posting. In my opinion, it's very likely they are after the answers in the dupe target. – Frank Jul 06 '17 at 10:21
@Frank I'm happy to edit. – ShanZhengYang Jul 06 '17 at 10:24
@akrun I count `Jack TRUE` to be 4. Rows 2, 3, 11, 14 – ShanZhengYang Jul 06 '17 at 10:25
1

@ShanZhengYang Please go ahead and edit your post with a reproducible example and expected output that matches – akrun Jul 06 '17 at 10:27
@ShanZhengYang If you check there are two 16 values for Jack – akrun Jul 06 '17 at 10:30
@akrun I was being confusing. Forget `order_no`. I am simply trying to count the unique number of rows with `Jack` and `TRUE` – ShanZhengYang Jul 06 '17 at 10:31
@ShanZhengYang Now, I get 5 for Jack and TRUE, updated the post – akrun Jul 06 '17 at 10:33

For R data.table, how to use uniqueN() in order count unique/distinct values in multiple columns?

2 Answers2