7

I have a data frame exemplified by the following

dist <- c(1.1,1.0,10.0,5.0,2.1,12.2,3.3,3.4)
id <- rep("A",length(dist))
df<-cbind.data.frame(id,dist)

df

  id dist
1  A  1.1
2  A  1.0
3  A 10.0
4  A  5.0
5  A  2.1
6  A 12.2
7  A  3.3
8  A  3.4

I need to clean it up so no row values in the dist column is bigger than 2 times the next row value at any time. A cleaned up data frame would look like this:

  id dist
1  A  1.1
2  A  1.0
5  A  2.1
7  A  3.3
8  A  3.4

I have tried making a function with a for loop and if statement to clean it

cleaner <-  function (df,dist,times_larger) {

              for (i in 1:(nrow(df)-1)) {

                  if (df$dist[i] > df$dist[i+1]*times_larger){
                    df<-df[-i,]
                    break       
                  }
              }
              df
            }

Obviously if I dont break the loop it will create an error because the number of rows in df will change in the process. If I manually run the loop on df several times:

df<-cleaner(df,"dist",2)

it will clean up as I want.

I have also tried different function constructions and applying it to the data frame with apply, but without any luck.

Do any have a good suggestion of either how to repeat the function on the data frame until it does not change anymore, a better function structure or maybe a better way of cleaning?

Any suggestions are most appreciated

Kristian
  • 73
  • 3

3 Answers3

6

You can shift your dist column one element left, multiply it by two, and compare with the original dist:

subset(df,dist < c(2*dist[-1],Inf))
#  id dist
#1  A  1.1
#2  A  1.0
#5  A  2.1
#7  A  3.3
#8  A  3.4
Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
5

You could try lead from dplyr

library(dplyr) #dplyr_0.4.0
filter(df, dist < 2 * lead(dist, default = Inf)) 
#    id dist
#1  A  1.1
#2  A  1.0
#3  A  2.1
#4  A  3.3
#5  A  3.4

Or using the similar method in data.table. A new function shift is introduced in the devel version of data.table. We can specify the type to lead. By default, it is lag and fill is NA. Modify the fill to 'Inf' (inspired from @Marat Talipov's post).

library(data.table) #data.table_1.9.5
setDT(df)[dist <2 *shift(dist,type='lead', fill=Inf)]
#   id dist
#1:  A  1.1
#2:  A  1.0
#3:  A  2.1
#4:  A  3.3
#5:  A  3.4

Update

If the value of 'dist' is equal to '2' times the next value, the above solutions removes that row. In such cases,

setDT(df)[dist <2 *(shift(dist,type='lead',
             fill=Inf)+.Machine$double.eps)]
#    id dist
#1:  A  1.1
#2:  A  1.0
#3:  A  2.1
#4:  A  3.3
#5:  A  3.4

Using a different example as commented by @Henrik.

df1 <- data.frame(dist= as.numeric(3:1))
setDT(df1)[dist <2 *(shift(dist,type='lead', 
            fill=Inf)+.Machine$double.eps)]
#    dist
#1:    3
#2:    2
#3:    1

Benchmarks

set.seed(49)
df <- data.frame(id='A', dist=rnorm(1e7,20))
df1 <- copy(df)
akrun1 <- function() {filter(df, dist < 2 * lead(dist,
                                 default = Inf)) }
akrun2 <- function() {setDT(df1)[dist <2 *shift(dist,type='lead',
                                     fill=Inf)]}
marat <- function() {subset(df,dist < c(2*dist[-1],Inf))}
Colonel <- function() {df[with(df, dist<2*c(dist[-1], tail(dist,1))),]}

library(microbenchmark)
microbenchmark(akrun1(), akrun2(), marat(), Colonel(), 
                                unit='relative', times=20L)
#Unit: relative
#    expr      min       lq     mean   median       uq      max neval  cld
# akrun1() 2.029087 1.990739 1.864697 1.965247 1.773722 1.727474    20  b  
# akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a   
# marat() 8.032147 8.137982 7.359821 7.937062 7.134686 5.837623     20  d
#Colonel() 7.094465 7.045000 6.473552 6.903460 6.197737 5.359575    20  c 
Arun
  • 116,683
  • 26
  • 284
  • 387
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @Henrik You are right. Thanks. All the current solutions does return only the 2 values. – akrun Jan 29 '15 at 18:12
  • 1
    `filter(df, dist < 2 * lead(dist, default = Inf))` works for me – hadley Jan 29 '15 at 20:40
  • @hadley Thanks a lot, I used the dplyr version, since my data frame had several sub groups it seemed natural. Only thing is that I have to filter several times; when values > 2*lead are filtered out, the resulting data frame may introduce new occurrences of leading values smaller than 2* the lagging value . This was not the case in my example though, so my bad. Thanks again for your quick and excellent replies. – Kristian Feb 04 '15 at 17:48
3

A base R solution:

> df[with(df, dist<2*c(dist[-1], tail(dist,1))),]
  id dist
1  A  1.1
2  A  1.0
5  A  2.1
7  A  3.3
8  A  3.4

If there are no zero elements:

df[with(df, dist/c(dist[-1], tail(dist,1)))<2,]
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87