3

Lets say I have a data frame

mydata <- data.frame(x = 1:25,
                     y = 26:50)

and another data frame with a set of min and max values

df.remove <- data.frame(min = c(3,10,22,17),
                        max = c(6,13,24,20))

Im looking to create an output where the rows with values in column x of mydata, that fall between each row of min and max in df.remove are deleted. thus giving me an output data frame

  x  y
  1 26
  2 27
  7 32
  8 33
  9 34
 14 39
 15 40
 16 41
 21 46
 25 50

I figured I can use the between() function to delete the values that fall between a range, and since I would be looking at the min and max values from each row in df.remove I attempted to run a loop using the code

result <- data.frame()
for(i in 1:nrow(df.filter)) {
  result <- mydata[!between(mydata$x,df.filter$min[i],df.filter$max[i]),]
}

This, for obvious reasons returns the output with only the last set of min and max values removed. I figured to get the output I am looking for I would likely have to run the consecutive iteration on the output from the previous iteration instead of the original data frame mydata, however I couldn't find a way to do it.

Orvill
  • 33
  • 3

6 Answers6

2

What you are looking for is known as non-equi anti-join. This can be done pretty easily with the data.table package. Consider

library(data.table)

mydata <- data.frame(x = 1:25, y = 26:50)
df.remove <- data.frame(min = c(3,10,22,17), max = c(6,13,24,20))
setDT(mydata)[!df.remove, on = .(x >= min, x <= max)] # drop rows where min <= x <= max

Output

     x  y
 1:  1 26
 2:  2 27
 3:  7 32
 4:  8 33
 5:  9 34
 6: 14 39
 7: 15 40
 8: 16 41
 9: 21 46
10: 25 50
ekoam
  • 8,744
  • 1
  • 9
  • 22
1

In your code, the result dataframe can only keep your last update, as you operated on the original mydata dataframe and assigned this single update to the result dataframe every time.

Instead, you are supposed to operate on the updated dataframe. You could try the following code.

result <- mydata
for(i in 1:nrow(df.remove)) {
     result <- result[!between(result$x,df.remove$min[i],df.remove$max[i]),]
}

After assigning the original mydata dataframe to the result dataframe, you are able to update it in an iterated way.

1

A base R approach -

res <- subset(mydata, !x %in% unlist(Map(`:`, df.remove$min, df.remove$max)))
res

#    x  y
#1   1 26
#2   2 27
#7   7 32
#8   8 33
#9   9 34
#14 14 39
#15 15 40
#16 16 41
#21 21 46
#25 25 50

Using Map we create sequence between min and max values, unlist them in a single vector and drop the rows if x has the same value.


Another option using fuzzyjoin package -

fuzzyjoin::fuzzy_anti_join(mydata, df.remove, 
                           c('x' = 'min', 'x' = 'max'), 
                           match_fun = c(`>=`, `<=`))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

Since you're using dplyr function between, we can use dplyr filter function. For each row of mydata you want to apply between to each row of df.remove to see if value of column x is between. This can be accomplished with mapply (since there are two values to input to the function). This will create a matrix of T/F. Then go through each row and see if any values are returned as T. Do this with apply, across rows. Negative filter for any row that returns a T indicating a value between the target value:

library(dplyr)
mydata %>% 
  filter(
    !mapply(function(left, right) between(mydata$x, left, right), left = df.remove$min, right = df.remove$max) %>% 
      apply(., 1, any)
    )

Returns:

    x  y
1   1 26
2   2 27
3   7 32
4   8 33
5   9 34
6  14 39
7  15 40
8  16 41
9  21 46
10 25 50
Brian Syzdek
  • 873
  • 6
  • 10
0

Just because this is an interesting problem which has several possible solutions, here is another approach using meta programming.

The idea is that we turn df.remove into a list of expressions which we then use inside filter(mydata, !!! .) by splicing it with the !!! operator.

One way to get the list of expressions is to use rowwise summarise and create a list of expressions with bquote which allows us to evaluate expressions wrapped in .(). In our case the min and max values.

And although this is possible, I'd probably use either @ekoam's {data.table} or @Ronak's base R approach.

library(dplyr)

df.remove %>% 
  rowwise %>% 
  summarise(x = list(bquote(!x %in% c(.(min):.(max))))) %>%
  pull(x) %>% 
  filter(mydata, !!! .)

#> `summarise()` has ungrouped output. You can override using the `.groups`
#> argument.
#>     x  y
#> 1   1 26
#> 2   2 27
#> 3   7 32
#> 4   8 33
#> 5   9 34
#> 6  14 39
#> 7  15 40
#> 8  16 41
#> 9  21 46
#> 10 25 50

Created on 2022-01-23 by the reprex package (v0.3.0)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
0

Using data.table::inrange.

library(data.table)
mydata[!mydata$x %inrange% df.remove, ]
#     x  y
# 1   1 26
# 2   2 27
# 7   7 32
# 8   8 33
# 9   9 34
# 14 14 39
# 15 15 40
# 16 16 41
# 21 21 46
# 25 25 50
jay.sf
  • 60,139
  • 8
  • 53
  • 110