Clean R data frame so that in a column no row value is bigger than 2 times next row value

Question

I have a data frame exemplified by the following

dist <- c(1.1,1.0,10.0,5.0,2.1,12.2,3.3,3.4)
id <- rep("A",length(dist))
df<-cbind.data.frame(id,dist)

df

  id dist
1  A  1.1
2  A  1.0
3  A 10.0
4  A  5.0
5  A  2.1
6  A 12.2
7  A  3.3
8  A  3.4

I need to clean it up so no row values in the dist column is bigger than 2 times the next row value at any time. A cleaned up data frame would look like this:

  id dist
1  A  1.1
2  A  1.0
5  A  2.1
7  A  3.3
8  A  3.4

I have tried making a function with a for loop and if statement to clean it

cleaner <-  function (df,dist,times_larger) {

              for (i in 1:(nrow(df)-1)) {

                  if (df$dist[i] > df$dist[i+1]*times_larger){
                    df<-df[-i,]
                    break       
                  }
              }
              df
            }

Obviously if I dont break the loop it will create an error because the number of rows in df will change in the process. If I manually run the loop on df several times:

df<-cleaner(df,"dist",2)

it will clean up as I want.

I have also tried different function constructions and applying it to the data frame with apply, but without any luck.

Do any have a good suggestion of either how to repeat the function on the data frame until it does not change anymore, a better function structure or maybe a better way of cleaning?

Any suggestions are most appreciated

Can you try fix() function? You can just do it manually if the data frame isn't so big. — duffymo, Jan 29 '15 at 17:29

Marat Talipov · Answer 1 · 2015-01-29T18:10:46.637

6

You can shift your dist column one element left, multiply it by two, and compare with the original dist:

subset(df,dist < c(2*dist[-1],Inf))
#  id dist
#1  A  1.1
#2  A  1.0
#5  A  2.1
#7  A  3.3
#8  A  3.4

edited Jan 29 '15 at 18:10

answered Jan 29 '15 at 17:32

Marat Talipov

13,064
5
34
53

score 5 · Accepted Answer · edited Apr 29 '15 at 21:19

You could try lead from dplyr

library(dplyr) #dplyr_0.4.0
filter(df, dist < 2 * lead(dist, default = Inf)) 
#    id dist
#1  A  1.1
#2  A  1.0
#3  A  2.1
#4  A  3.3
#5  A  3.4

Or using the similar method in data.table. A new function shift is introduced in the devel version of data.table. We can specify the type to lead. By default, it is lag and fill is NA. Modify the fill to 'Inf' (inspired from @Marat Talipov's post).

library(data.table) #data.table_1.9.5
setDT(df)[dist <2 *shift(dist,type='lead', fill=Inf)]
#   id dist
#1:  A  1.1
#2:  A  1.0
#3:  A  2.1
#4:  A  3.3
#5:  A  3.4

Update

If the value of 'dist' is equal to '2' times the next value, the above solutions removes that row. In such cases,

setDT(df)[dist <2 *(shift(dist,type='lead',
             fill=Inf)+.Machine$double.eps)]
#    id dist
#1:  A  1.1
#2:  A  1.0
#3:  A  2.1
#4:  A  3.3
#5:  A  3.4

Using a different example as commented by @Henrik.

df1 <- data.frame(dist= as.numeric(3:1))
setDT(df1)[dist <2 *(shift(dist,type='lead', 
            fill=Inf)+.Machine$double.eps)]
#    dist
#1:    3
#2:    2
#3:    1

Benchmarks

set.seed(49)
df <- data.frame(id='A', dist=rnorm(1e7,20))
df1 <- copy(df)
akrun1 <- function() {filter(df, dist < 2 * lead(dist,
                                 default = Inf)) }
akrun2 <- function() {setDT(df1)[dist <2 *shift(dist,type='lead',
                                     fill=Inf)]}
marat <- function() {subset(df,dist < c(2*dist[-1],Inf))}
Colonel <- function() {df[with(df, dist<2*c(dist[-1], tail(dist,1))),]}

library(microbenchmark)
microbenchmark(akrun1(), akrun2(), marat(), Colonel(), 
                                unit='relative', times=20L)
#Unit: relative
#    expr      min       lq     mean   median       uq      max neval  cld
# akrun1() 2.029087 1.990739 1.864697 1.965247 1.773722 1.727474    20  b  
# akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a   
# marat() 8.032147 8.137982 7.359821 7.937062 7.134686 5.837623     20  d
#Colonel() 7.094465 7.045000 6.473552 6.903460 6.197737 5.359575    20  c

@Henrik You are right. Thanks. All the current solutions does return only the 2 values. — akrun, Jan 29 '15 at 18:12
`filter(df, dist < 2 * lead(dist, default = Inf))` works for me — hadley, Jan 29 '15 at 20:40
@hadley Thanks a lot, I used the dplyr version, since my data frame had several sub groups it seemed natural. Only thing is that I have to filter several times; when values > 2*lead are filtered out, the resulting data frame may introduce new occurrences of leading values smaller than 2* the lagging value . This was not the case in my example though, so my bad. Thanks again for your quick and excellent replies. — Kristian, Feb 04 '15 at 17:48

Colonel Beauvel · Answer 3 · 2015-01-29T17:47:20.520

3

A base R solution:

> df[with(df, dist<2*c(dist[-1], tail(dist,1))),]
  id dist
1  A  1.1
2  A  1.0
5  A  2.1
7  A  3.3
8  A  3.4

If there are no zero elements:

df[with(df, dist/c(dist[-1], tail(dist,1)))<2,]

edited Jan 29 '15 at 17:47

answered Jan 29 '15 at 17:38

Colonel Beauvel

30,423
11
47
87

but what if some `dist` elements are zeroes? – Marat Talipov Jan 29 '15 at 17:43
exactly, your multiplication solution is better for general case! – Colonel Beauvel Jan 29 '15 at 17:45

Clean R data frame so that in a column no row value is bigger than 2 times next row value

3 Answers3

Update

Benchmarks