R: How to identify unknown number of combinations?

Question

I have data in the following format:

Data <- data.frame(
  Names = c("Person A", "Person B","Person F", "Person G", "Person F", "Person G", "Person Q", "Person R"),
  Time_Stamp = c("2013-08-01 07:06:00", "2013-08-01 07:06:00", "2013-08-01 07:53:00", "2013-08-01 07:53:00", "2013-08-01 11:01:00", "2013-08-01 11:01:00", "2013-08-01 11:08:00", "2013-08-19 06:57:00")
)

#> Data
#      Names          Time_Stamp
# 1 Person A 2013-08-01 07:06:00
# 2 Person B 2013-08-01 07:06:00
# 3 Person F 2013-08-01 07:53:00
# 4 Person G 2013-08-01 07:53:00
# 5 Person F 2013-08-01 11:01:00
# 6 Person G 2013-08-01 11:01:00
# 7 Person Q 2013-08-01 11:08:00
# 8 Person R 2013-08-19 06:57:00

I would like to create a code that identifies when a combination (order doesn't matter) of people appear together with the same time stamp. So, for example, Person F and Person G appear together at the same time, 8:14 on 8/1/13, so they are a group and get a unique group name. If they show up again together, they still get the same name. The issue I have been having is that the real data is nearly 100,000 rows, and I do not know how many combinations of people I have in it that appear with the same time stamp, and combinations may have more than just 2 people.

I would like the new data to look like this:

Desired <- data.frame(
  Names = c("Person A", "Person B","Person F", "Person G", "Person F", "Person G", "Person Q", "Person R"),
  Time_Stamp = c("2013-08-01 07:06:00", "2013-08-01 07:06:00", "2013-08-01 07:53:00", "2013-08-01 07:53:00", "2013-08-01 11:01:00", "2013-08-01 11:01:00", "2013-08-01 11:08:00", "2013-08-19 06:57:00"),
  Group = c("Group 1", "Group 1", "Group 2", "Group 2", "Group 2", "Group 2", "No Group", "No Group")
)
#      Names          Time_Stamp    Group
# 1 Person A 2013-08-01 07:06:00  Group 1
# 2 Person B 2013-08-01 07:06:00  Group 1
# 3 Person F 2013-08-01 07:53:00  Group 2
# 4 Person G 2013-08-01 07:53:00  Group 2
# 5 Person F 2013-08-01 11:01:00  Group 2
# 6 Person G 2013-08-01 11:01:00  Group 2
# 7 Person Q 2013-08-01 11:08:00 No Group
# 8 Person R 2013-08-19 06:57:00 No Group

Yes, and then I would just drop them after. I'm only interested in people who appear in groups, not individually. — ellie9813, Apr 16 '20 at 21:08

Rui Barradas · Answer 1 · 2020-04-17T18:46:39.850

I believe the following function does what the question asks for.

The code works as follows:

Factors are coded internally as consecutive integers, so coerce to factor and then to integer to get unique integers per each unique "Time_Stamp" seen as a string.
Use ave to split that vector of integers by "Names", keeping just the first if f spans several names.
Finally, if each level of f has only one element, return "No Group" else paste "Group" before that level.

This function uses base R only but can be used in a dplyr::mutate instruction.

group_names <- function(x, col.name, col.date){
  f <- as.integer(as.factor(x[[col.date]]))
  f <- ave(f, x[[col.name]], FUN = function(x){
    if(length(x) > 1) x[1] else x
  })
  f <- ave(f, f, FUN = function(x){
    if(length(x) == 1) "No Group" else paste("Group", x)
  })
  f
}

Data$Group <- group_names(Data, "Names", "Time_Stamp")

Or, with dplyr. Both column numbers or column names work.

Data %>% mutate(Group = group_names(., 1, 2))
Data %>% mutate(Group = group_names(., "Names", "Time_Stamp"))
#     Names          Time_Stamp    Group
#1 Person A 2013-08-01 07:06:00  Group 1
#2 Person B 2013-08-01 07:06:00  Group 1
#3 Person F 2013-08-01 07:53:00  Group 2
#4 Person G 2013-08-01 07:53:00  Group 2
#5 Person F 2013-08-01 11:01:00  Group 2
#6 Person G 2013-08-01 11:01:00  Group 2
#7 Person Q 2013-08-01 11:08:00 No Group
#8 Person R 2013-08-19 06:57:00 No Group

Hello! Could you please explain the code in the first part of the answer? — ellie9813, Apr 17 '20 at 17:46
Thank you! I am now noticing on my real data something I didn't consider before. If there is a group of 5 people who all have the same time stamp on the same day, say it's (Person A, Person B, Person C, Person D, & Person E), and there are occurrences of smaller groups within that group of 5, say (Person A & Person B) have the same time stamp on a new day and (Person A & Person E) have the same time stamp on another day, they are still marked as Group 1, but I would like for each of those smaller groups to be marked as separate groups. Do you know how I would go about doing that? I am stumped. — ellie9813, Apr 17 '20 at 21:19

ThomasIsCoding · Answer 2 · 2020-04-17T09:03:00.137

3

Here is a solution using igraph

library(igraph)
u <- graph_from_data_frame(Data)
grp <- clusters(u)$membership[match(Data$Names,names(clusters(u)$membership))]
Desired <- within(Data, Group <- ave(grp,grp,FUN = function(x) {if (length(x)>1) paste("Group",x) else "No Group"}))

such that

> Desired
     Names          Time_Stamp    Group
1 Person A 2013-08-01 07:06:00  Group 1
2 Person B 2013-08-01 07:06:00  Group 1
3 Person F 2013-08-01 07:53:00  Group 2
4 Person G 2013-08-01 07:53:00  Group 2
5 Person F 2013-08-01 11:01:00  Group 2
6 Person G 2013-08-01 11:01:00  Group 2
7 Person Q 2013-08-01 11:08:00 No Group
8 Person R 2013-08-19 06:57:00 No Group

edited Apr 17 '20 at 09:03

answered Apr 16 '20 at 21:29

ThomasIsCoding

96,636
9
24
81

Thank you for the answer! So this one, I tested it on my real data, and it's close to what I need, but it just creates Group Names in numerical order based on the first occurrence of the group. So, if a group occurs multiple times, the next occurrence(s) of that group get a new name. The first answer though did what I needed it to do, so no worries. Thanks again! – ellie9813 Apr 16 '20 at 21:45
@ellie9813 Thanks for feedback. I updated my solution which used `igraph` for clustering and labeling. Maybe you can check it out. – ThomasIsCoding Apr 17 '20 at 09:04

chinsoon12 · Answer 3 · 2020-04-17T22:19:07.570

3

An option using data.table:

library(data.table)
setDT(Data, key=c("Time_Stamp","Names"))
Data[, g := if (.N > 1L) paste(Names, collapse=""), Time_Stamp]
Data[order(g), g := fifelse(is.na(g), NA_integer_, rleid(g))]

output:

      Names          Time_Stamp    g
1: Person A 2013-08-01 07:06:00    1
2: Person B 2013-08-01 07:06:00    1
3: Person F 2013-08-01 07:53:00    2
4: Person G 2013-08-01 07:53:00    2
5: Person F 2013-08-01 11:01:00    2
6: Person G 2013-08-01 11:01:00    2
7: Person Q 2013-08-01 11:08:00 <NA>
8: Person R 2013-08-19 06:57:00 <NA>

edited Apr 17 '20 at 22:19

answered Apr 17 '20 at 01:18

chinsoon12

25,005
4
25
35

Thank you for the answer! I tried this on my actual data, but for some reason nobody is getting assigned NA. Could you please explain the code and I can see if there's something I'm missing? – ellie9813 Apr 17 '20 at 17:09
can you try adding a record to the top of your data set, e.g. `data.table(Names="ajdijoeqnwfnwd", Time_Stamp="aksjdlakjds")`, if this is not NA then there is really a problem with the code. i also sorted the dataset as Names might not be sorted. – chinsoon12 Apr 17 '20 at 22:18

R: How to identify unknown number of combinations?

3 Answers3