4

I am trying to solve a problem with R using rle() (or another relevant function) but am not sure where to start. The problem is as follows - foo, bar, and baz and qux can be in one of three positions - A, B, or C.

Their first position will always be A, and their last position will always be C, but their positions in between are random.

My objective is to eliminate the first A or first sequence of A's, and the last C or the last sequence of C's. For example:

> foo
   position
1         A
2         A
3         A
4         B
5         B
6         A
7         B
8         A
9         C
10        C

> output(foo)
   position

4         B
5         B
6         A
7         B
8         A


> bar
   position
1         A
2         B
3         A
4         B
5         A
6         C
7         C
8         C
9         C
10        C

> output(bar)
   position

2         B
3         A
4         B
5         A

> baz
   position
1         A
2         A
3         A
4         A
5         A
6         C
7         C
8         C
9         C
10        C

> output(baz)
NULL

> qux
  position
1        A
2        C
3        A
4        C
5        A
6        C

> output(qux)
  position
2        C
3        A
4        C
5        A

Basic rle() will tell me about the sequences and their lengths but it will not preserve row indices. How should one go about solving this problem?

> rle(foo$position)
Run Length Encoding
  lengths: int [1:6] 3 2 1 1 1 2
  values : chr [1:6] "A" "B" "A" "B" "A" "C"
iskandarblue
  • 7,208
  • 15
  • 60
  • 130

5 Answers5

3

I would write a function using cumsum where we check how many of first consecutive values start with first_position and how many of last consecutive values start with last_position and remove them.

get_reduced_data <- function(dat, first_position, last_position) {
    dat[cumsum(dat != first_position) != 0 &
   rev(cumsum(rev(dat) != last_position) != 0)]
 }

get_reduced_data(foo, first_position, last_position)
#[1] "B" "B" "A" "B" "A"

get_reduced_data(bar, first_position, last_position)
#[1] "B" "A" "B" "A"

get_reduced_data(baz, first_position, last_position)
#character(0)

get_reduced_data(qux, first_position, last_position)
#[1] "C" "A" "C" "A"

data

foo <- c("A", "A","A", "B", "B", "A", "B", "A", "C")
bar <- c("A", "B","A", "B", "A", "C", "C", "C", "C", "C")
baz <- c(rep("A", 5), rep("C", 5))
qux <- c("A", "C", "A", "C", "A", "C")
first_position <- "A"
last_position <- "C"
mt1022
  • 16,834
  • 5
  • 48
  • 71
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Here is one option with rle. The idea would be to subset the 1st and last values, check whether it is equal to 'A', 'C', assign it to NA and convert that to a logical vector for subsetting

i1 <- !is.na(inverse.rle(within.list(rle(foo$position), 
     values[c(1, length(values))][values[c(1, length(values))] == c("A", "C")] <- NA)))
foo[i1, , drop = FALSE]
#    position
#4        B
#5        B
#6        A
#7        B
#8        A
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Another possible solution without rle by creating an index and subsetting rows to between first occurrence of non-A and last occurrence of non-C:

library(data.table)
output <- function(DT) {
    DT[, rn:=.I][,{
            mn <- min(which(position!="A"))
            mx <- max(which(position!="C"))
            if (mn > mx) return(NULL)
            .SD[mn:mx]
        }]
}

output(setDT(foo))
#   position rn
#1:        B  4
#2:        B  5
#3:        A  6
#4:        B  7
#5:        A  8

output(setDT(baz))
#NULL

data:

foo <- fread("position
A
A
A
B
B
A
B
A
C
C")

baz <- fread("position
A
A
A
A
A
C
C
C
C
C")
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0

A approach could be,

library(data.table)

setDT(df)[, grp := rleid(position)][
  !(grp == 1 & position == 'A' | grp == max(grp) & position == 'C'), ][
    , grp := NULL][]

which gives,

   position
1:        B
2:        B
3:        A
4:        B
5:        A
Sotos
  • 51,121
  • 6
  • 32
  • 66
0

The problem seems to be two-fold. Triming 'first' and 'last' elements, and identifying what constitutes 'first' and 'last'. I like your rle() approach, because it maps many possibilities into a common structure. So the task is to write a function to mask the first and last elements of a vector of any length

mask_end = function(x) {
    n = length(x)
    mask = !logical(n)
    mask[c(min(1, n), max(0, n))] = FALSE  # allow for 0-length x
    mask
}

This is very easy to test comprehensively

> mask_end(integer(0))
logical(0)
> mask_end(integer(1))
[1] FALSE
> mask_end(integer(2))
[1] FALSE FALSE
> mask_end(integer(3))
[1] FALSE  TRUE FALSE
> mask_end(integer(4))
[1] FALSE  TRUE  TRUE FALSE

The solution (returning the mask; easy to modify to return the actual values, x[inverse.rle(r)]) is then

mask_end_runs = function(x) {
    r = rle(x)
    r$values = mask_end(r$values)
    inverse.rle(r)
}
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112