2

Considering this sample

df<-{data.frame(v0=c(1, 2, 5, 1, 2, 0, 1, 2, 2, 2, 5),v1=c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'b', 'b', 'a', 'a'), v2=c(0, 10, 5, 1, 8, 5,10, 3, 3, 1, 5))}

For a large dataframe: if v0>4, drop all the rows containing corresponding value v1 (drop a group?).

So, here the result should be a dataframe dropping all the rows with "a" since v0 values of 5 exist for "a".

df_ExpectedResult<-{data.frame(v0=c( 1, 2, 0, 1, 2, 2 ),v1=c( 'b', 'b', 'c', 'c', 'b', 'b'), v2=c(1, 8, 5,10, 3, 3))} 

Also, I would like to have a new dataframe keeping the dropped groups.

df_Dropped <- {data.frame(v1='a')}

How would you do this efficiently for a huge dataset? I am using a simple for loop and if statement, but it takes too long to do the manipulation.

ie-con
  • 53
  • 4

3 Answers3

2

A base R option using subset + ave

subset(df, !ave(v0 > 4, v1, FUN = any))

gives

  v0 v1 v2
4  1  b  1
5  2  b  8
6  0  c  5
7  1  c 10
8  2  b  3
9  2  b  3
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
2

An option with dplyr

library(dplyr)
df %>%
    group_by(v1) %>%
    filter(sum(v0 > 4) < 1) %>%
    ungroup

-output

# A tibble: 6 x 3
#     v0 v1       v2
#  <dbl> <chr> <dbl>
#1     1 b         1
#2     2 b         8
#3     0 c         5
#4     1 c        10
#5     2 b         3
#6     2 b         3
akrun
  • 874,273
  • 37
  • 540
  • 662
1

It's two operations, but what about this:

drop_groups <- df %>% filter(v0 > 4) %>% select(v1) %>% unique()
df_result <- df %>% filter(!(v1 %in% drop_groups))
df_result
#   v0 v1 v2
# 1  1  b  1
# 2  2  b  8
# 3  0  c  5
# 4  1  c 10
# 5  2  b  3
# 6  2  b  3
DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25