r - Create a sequence number for each row within a category (defined by 2+ fields) in a dataframe

Question

I want to generate an id number within each group/subset of a dataframe Where each group is defined by two fields or more. In this test dataset I want to use "personid" and "date" as my category:

personid date measurement 
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23

I wish to add an id column with a value for each unique combination of the two column"personid" and "date", always starting with 1. This is my desired output:

personid date measurement id
1         x     23         1
1         x     32         1
2         y     21         1
3         x     23         1
3         z     23         2
3         y     23         3

This is a similar question to the 3 year old version Create a sequential number (counter) for rows within each group of a dataframe but after many attempts I wasn't able to extend their logic for my 2+fields category definition. Thanks!

Can you show what you attempted? The answers in the other questions should be pretty straightforward to adapt. — MrFlick, Jul 29 '16 at 19:41
Something like: `library(data.table); DT[, id := rleid(date), by = personid]`? — Jaap, Jul 29 '16 at 19:43

score 3 · Answer 1 · answered Jul 29 '16 at 20:14

Two possibilities with the data.table package:

library(data.table)
# option 1
setDT(df)[, id := frank(date, ties.method = 'dense'), by = personid][]
# option 2
setDT(df)[, id := rleid(date), by = personid]

which gives:

   personid date measurement id
1:        1    x          23  1
2:        1    x          32  1
3:        2    y          21  1
4:        3    x          23  1
5:        3    z          23  3
6:        3    y          23  2

score 1 · Answer 2 · answered Jul 29 '16 at 20:11

1

Here's one way to do it:

df <- data.frame(personid = c(1,1,2,3,3,3), 
                 date = c("x","x","y","x","z","y"), 
                 measurement = c(23,32,31,23,23,23))

#This should create a unique character string for each personid-date pair:
idChar <- paste(df$personid, df$date, sep = ".")

#unique() preserves the order of the first appearance of each pair,
#and match() tells the index of each pair in unique(idChar) for each idChar:
df$id <- match(idChar, unique(idChar))

answered Jul 29 '16 at 20:11

Vince FitzPatrick

76
1

Thanks for your response, when I tested your code, it produced a unique id for each pair, but didnt restart when a new personid was presented, as my desired output reflects. Most likely I wasn't too clear stating that in my question. – ecolog Jul 29 '16 at 20:21

score 0 · Accepted Answer · answered Jul 29 '16 at 19:59

0

Same idea as @Procrastinatus Maximus's rleid, here is a dplyr version of it:

library(dplyr)
df %>% 
      arrange(personid, date) %>% 
      group_by(personid) %>% 
      mutate(id = cumsum(date != lag(date, default = first(date))) + 1)
      # +1 converts the zero based id to one based id here

# Source: local data frame [6 x 4]
# Groups: personid [3]
#
#   personid   date measurement    id
#      <int> <fctr>       <int> <dbl>
# 1        1      x          23     1
# 2        1      x          32     1
# 3        2      y          21     1
# 4        3      x          23     1
# 5        3      y          23     2
# 6        3      z          23     3

In order for rleid or cumsum to work here, we have to sort the data frame by personid and then date since both methods only care about adjacent values.

answered Jul 29 '16 at 19:59

Psidom

209,562
33
339
356

thanks! I see that your output is what I want, don't know why when I run the code I get this error: Error: expecting a single value – ecolog Jul 29 '16 at 20:06
What is the `class(df$date)`? The error means the data type returned from `cumsum` is not correct. – Psidom Jul 29 '16 at 20:11
Not quite sure what is happening. If it is a factor, it should work. – Psidom Jul 29 '16 at 20:16
Just to add another method. You can try `df %>% group_by(personid) %>% mutate(id = dense_rank(date))` as well. – Psidom Jul 29 '16 at 20:24
These second one worked with no problem. Don't know what was the issue with the first one. – ecolog Jul 29 '16 at 21:00

r - Create a sequence number for each row within a category (defined by 2+ fields) in a dataframe

3 Answers3