0

I want to generate an id number within each group/subset of a dataframe Where each group is defined by two fields or more. In this test dataset I want to use "personid" and "date" as my category:

personid date measurement 
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23

I wish to add an id column with a value for each unique combination of the two column"personid" and "date", always starting with 1. This is my desired output:

personid date measurement id
1         x     23         1
1         x     32         1
2         y     21         1
3         x     23         1
3         z     23         2
3         y     23         3

This is a similar question to the 3 year old version Create a sequential number (counter) for rows within each group of a dataframe but after many attempts I wasn't able to extend their logic for my 2+fields category definition. Thanks!

Community
  • 1
  • 1
ecolog
  • 78
  • 7
  • Can you show what you attempted? The answers in the other questions should be pretty straightforward to adapt. – MrFlick Jul 29 '16 at 19:41
  • 1
    Something like: `library(data.table); DT[, id := rleid(date), by = personid]`? – Jaap Jul 29 '16 at 19:43

3 Answers3

3

Two possibilities with the data.table package:

library(data.table)
# option 1
setDT(df)[, id := frank(date, ties.method = 'dense'), by = personid][]
# option 2
setDT(df)[, id := rleid(date), by = personid]

which gives:

   personid date measurement id
1:        1    x          23  1
2:        1    x          32  1
3:        2    y          21  1
4:        3    x          23  1
5:        3    z          23  3
6:        3    y          23  2
Jaap
  • 81,064
  • 34
  • 182
  • 193
1

Here's one way to do it:

df <- data.frame(personid = c(1,1,2,3,3,3), 
                 date = c("x","x","y","x","z","y"), 
                 measurement = c(23,32,31,23,23,23))

#This should create a unique character string for each personid-date pair:
idChar <- paste(df$personid, df$date, sep = ".")

#unique() preserves the order of the first appearance of each pair,
#and match() tells the index of each pair in unique(idChar) for each idChar:
df$id <- match(idChar, unique(idChar))
  • Thanks for your response, when I tested your code, it produced a unique id for each pair, but didnt restart when a new personid was presented, as my desired output reflects. Most likely I wasn't too clear stating that in my question. – ecolog Jul 29 '16 at 20:21
0

Same idea as @Procrastinatus Maximus's rleid, here is a dplyr version of it:

library(dplyr)
df %>% 
      arrange(personid, date) %>% 
      group_by(personid) %>% 
      mutate(id = cumsum(date != lag(date, default = first(date))) + 1)
      # +1 converts the zero based id to one based id here

# Source: local data frame [6 x 4]
# Groups: personid [3]
#
#   personid   date measurement    id
#      <int> <fctr>       <int> <dbl>
# 1        1      x          23     1
# 2        1      x          32     1
# 3        2      y          21     1
# 4        3      x          23     1
# 5        3      y          23     2
# 6        3      z          23     3

In order for rleid or cumsum to work here, we have to sort the data frame by personid and then date since both methods only care about adjacent values.

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • thanks! I see that your output is what I want, don't know why when I run the code I get this error: Error: expecting a single value – ecolog Jul 29 '16 at 20:06
  • What is the `class(df$date)`? The error means the data type returned from `cumsum` is not correct. – Psidom Jul 29 '16 at 20:11
  • Not quite sure what is happening. If it is a factor, it should work. – Psidom Jul 29 '16 at 20:16
  • Just to add another method. You can try `df %>% group_by(personid) %>% mutate(id = dense_rank(date))` as well. – Psidom Jul 29 '16 at 20:24
  • These second one worked with no problem. Don't know what was the issue with the first one. – ecolog Jul 29 '16 at 21:00