How do I create a new column for subsetting an ID column in an R dataframe?

Question

I have a dataframe with an ID column that has multiple observations per unique ID. I also have a column that has a binary value - "commute" or "cluster". How it usually goes is about a dozen or so observations of cluster broken up by one or two observations of commute, and then back to several clusters, etc. What I want to do is create a seperate column for ID where each unique ID is broken up into several ID's based on clusters broken up by commuting.

As in, say observations 1-15 are cluster, then 16-17 is commute, then 18-30 is cluster. And these all belong to ID "1". What I want is observations 1-15 to be labelled as "1.1" and then observations 18-30 to be "1.2", etc. So each consecutive grouping of clusters is banded together in its own uniqe ID, then the next grouping of clusters also has its own unique ID, and each uniqe ID is broken up by one or more commute observations. I hope I've explained that well enough to communicate what I want.

group_by(track_ID) %>%
  mutate(foraging_event = ifelse(activity3 == "commuting", NA, 
cumsum(lag(activity3 == "commuting", default = TRUE)))) %>%
  mutate(foraging_event_tight = ifelse(activity2 != "commuting", 
NA, cumsum(lag(activity2 != "commuting", default = TRUE)))) %>%
  ungroup()

This is a code we tried but did not work. Does anyone has suggestions for what we might be able to do?

I posted an answer below, but I think the main problem with your code is is that you close the parentheses for `lag()` in the wrong place. It should be `lag(activity2) != "commuting",... — DaveArmstrong, Aug 11 '23 at 02:34

score 0 · Answer 1 · answered Aug 11 '23 at 02:17

It's hard to make something which I can be sure will work for your actual data when you haven't shared your actual data, and the code references columns you haven't described, but something similar to the following should work.

Essentially, we just find every instance where the previous value is different from the current one, get the cumulative sum of that (plus 1, to make it not zero indexed), then concatenate that with the ID

df <- data.frame(
    id = 1,
    observation = c(rep("cluster",15), rep("commute", 2), rep("cluster", 13))
)
dplyr::mutate(df, new_id = paste(id, cumsum(observation != lag(observation, default = first(observation))) + 1, sep = "."), .by = id)

# Output
   id observation new_id
1   1     cluster    1.1
2   1     cluster    1.1
3   1     cluster    1.1
4   1     cluster    1.1
5   1     cluster    1.1
6   1     cluster    1.1
7   1     cluster    1.1
8   1     cluster    1.1
9   1     cluster    1.1
10  1     cluster    1.1
11  1     cluster    1.1
12  1     cluster    1.1
13  1     cluster    1.1
14  1     cluster    1.1
15  1     cluster    1.1
16  1     commute    1.2
17  1     commute    1.2
18  1     cluster    1.3
19  1     cluster    1.3
20  1     cluster    1.3
21  1     cluster    1.3
22  1     cluster    1.3
23  1     cluster    1.3
24  1     cluster    1.3
25  1     cluster    1.3
26  1     cluster    1.3
27  1     cluster    1.3
28  1     cluster    1.3
29  1     cluster    1.3
30  1     cluster    1.3

score 0 · Answer 2 · answered Aug 11 '23 at 02:29

Her's an example that's along the same lines as your original code. The idea here is to replace the first row number in the subgroup variable with 1 and then it's zero every time except for when there is a change from commute to cluster (in which case it gets a 1). Then paste the cumulative sum of the subgroup variable to track_ID:

library(dplyr)
dat <- data.frame(
  track_ID = rep(1:2, each=30), 
  activity2 = c(rep("cluster", 15),"commute", "commute", rep("cluster", 13), 
                rep("cluster", 10),"commute", "commute", "commute", rep("cluster", 17))
)
dat <- dat %>% 
  group_by(track_ID) %>% 
  mutate(num = case_when(row_number() ==1 & activity2 == "cluster" ~ 1,
                         lag(activity2) == "commute" & activity2 == "cluster" ~ 1, 
                         TRUE ~ 0), 
         grp = paste(track_ID, cumsum(num), sep="."),
         grp = ifelse(activity2 == "commute", NA, grp))

table(dat$grp)
#> 
#> 1.1 1.2 2.1 2.2 
#>  15  13  10  17

^{Created on 2023-08-11 with reprex v2.0.2}

How do I create a new column for subsetting an ID column in an R dataframe?

2 Answers2