Omit ID's which occur less than x times with a combination of vectors

Question

I have a large dataframe with speciesID's and Individual ID's.

For my dataset I need to remove the SpeciesID's when the occurrence of the unique combination of SpeciesID and IndID are less than 4 times.

For example I have the dataset:

SpeciesID   IndID
99          13-001
99          13-001
99          14-002
99          14-002
99          14-002
100         14-005
100         14-005
100         14-005
100         14-006
100         14-007
100         14-007
100         14-008
100         14-009
500         16-001
500         16-001
500         16-002
500         16-002
500         16-002
500         16-003
500         16-003
500         16-004
500         16-004
500         16-005
500         16-006
500         16-006
500         16-007

Seeing this dataset I want to remove the rows where the unique combination of SpeciesID and IndID occur less than 5 times:

In this case I want to remove:

Because the unique combination of:

99  13-001
99  14-002

only appears 2 times.

Please format your post so as to make it easily readable. – CertainPerformance Jul 25 '18 at 09:16 — CertainPerformance, Jul 25 '18 at 09:16
Thanks to Ronak Shan it was edited! Thank you – Suusie Jul 25 '18 at 09:23 — Suusie, Jul 25 '18 at 09:23

score 1 · Answer 1 · answered Jul 25 '18 at 09:35

Although the condition is not true for any of the combinations, please see this as a starting point.

  library(data.table)
  df <- structure(list(SpeciesID = c(99L, 99L, 99L, 99L, 99L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L, 500L)
 , IndID = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 13L, 13L, 14L), .Label = c("13-001", "14-002", "14-005", "14-006", "14-007", "14-008", "14-009", "16-001", "16-002", "16-003", "16-004", "16-005", "16-006", "16-007"), class = "factor")), class = "data.frame", row.names = c(NA, -26L))

  dt <- data.table(df)
  # Selecting all combinations which appear at least 3 times in the dataset
  dt[, .( SpeciesID
        , IndID
        , .N
)
, by = list(Unique.ID = paste0(SpeciesID, IndID))][N < 3,]

Ronak Shah · Accepted Answer · 2018-07-25T09:51:56.893

Using base R we calculate number of unique values per SpeciesID and select only those SpeciesID which occur greater than equal to 5 times.

df[ave(df$IndID, df$SpeciesID, FUN = function(x) length(unique(x))) >= 5, ]

#   SpeciesID  IndID
#6        100 14-005
#7        100 14-005
#8        100 14-005
#9        100 14-006
#10       100 14-007
#11       100 14-007
#12       100 14-008
#13       100 14-009
#14       500 16-001
#15       500 16-001
#16       500 16-002
#17       500 16-002
#18       500 16-002
#19       500 16-003
#20       500 16-003
#21       500 16-004
#22       500 16-004
#23       500 16-005
#24       500 16-006
#25       500 16-006
#26       500 16-007

length(unique(x)) can also be replaced by n_distinct from dplyr

library(dplyr)
df[ave(df$IndID, df$SpeciesID, FUN = n_distinct) >= 5, ]

Or a complete dplyr solution which is more verbose could be

library(dplyr)
df %>%
  group_by(SpeciesID) %>%
  filter(n_distinct(IndID) >= 5)

Thank you! That last one did the trick ;) – Suusie Jul 25 '18 at 10:22 — Suusie, Jul 25 '18 at 10:22

score 0 · Answer 3 · answered Jul 25 '18 at 09:35

You could use dplyr:

library(dplyr)

Group your data by SpeciesID and IndID, count how often the combination appears using row_number(), and filter groups where the maximum is above a certain threshold:

"SpeciesID   IndID
99          13-001
99          13-001
99          14-002
99          14-002
99          14-002
100         14-005
100         14-005
100         14-005
100         14-006
100         14-007
100         14-007
100         14-008
100         14-009
500         16-001
500         16-001
500         16-002
500         16-002
500         16-002
500         16-003
500         16-003
500         16-004
500         16-004
500         16-005
500         16-006
500         16-006
500         16-007" %>% 
  read.table(text = ., header = TRUE) %>% 
  group_by(SpeciesID, IndID) %>% 
  mutate(rn = row_number()) %>% 
  mutate(max = max(rn)) %>% 
  filter(max >= 3) %>% 
  select(SpeciesID, IndID)

The result (for threshold == 3):

# A tibble: 9 x 2
# Groups:   SpeciesID, IndID [3]
  SpeciesID IndID 
      <int> <fct> 
1        99 14-002
2        99 14-002
3        99 14-002
4       100 14-005
5       100 14-005
6       100 14-005
7       500 16-002
8       500 16-002
9       500 16-002

Omit ID's which occur less than x times with a combination of vectors

3 Answers3