Using rle to eliminate first and last sequences

Question

I am trying to solve a problem with R using rle() (or another relevant function) but am not sure where to start. The problem is as follows - foo, bar, and baz and qux can be in one of three positions - A, B, or C.

Their first position will always be A, and their last position will always be C, but their positions in between are random.

My objective is to eliminate the first A or first sequence of A's, and the last C or the last sequence of C's. For example:

> foo
   position
1         A
2         A
3         A
4         B
5         B
6         A
7         B
8         A
9         C
10        C

> output(foo)
   position

4         B
5         B
6         A
7         B
8         A


> bar
   position
1         A
2         B
3         A
4         B
5         A
6         C
7         C
8         C
9         C
10        C

> output(bar)
   position

2         B
3         A
4         B
5         A

> baz
   position
1         A
2         A
3         A
4         A
5         A
6         C
7         C
8         C
9         C
10        C

> output(baz)
NULL

> qux
  position
1        A
2        C
3        A
4        C
5        A
6        C

> output(qux)
  position
2        C
3        A
4        C
5        A

Basic rle() will tell me about the sequences and their lengths but it will not preserve row indices. How should one go about solving this problem?

> rle(foo$position)
Run Length Encoding
  lengths: int [1:6] 3 2 1 1 1 2
  values : chr [1:6] "A" "B" "A" "B" "A" "C"

score 3 · Answer 1 · edited Apr 18 '18 at 07:06

I would write a function using cumsum where we check how many of first consecutive values start with first_position and how many of last consecutive values start with last_position and remove them.

get_reduced_data <- function(dat, first_position, last_position) {
    dat[cumsum(dat != first_position) != 0 &
   rev(cumsum(rev(dat) != last_position) != 0)]
 }

get_reduced_data(foo, first_position, last_position)
#[1] "B" "B" "A" "B" "A"

get_reduced_data(bar, first_position, last_position)
#[1] "B" "A" "B" "A"

get_reduced_data(baz, first_position, last_position)
#character(0)

get_reduced_data(qux, first_position, last_position)
#[1] "C" "A" "C" "A"

data

foo <- c("A", "A","A", "B", "B", "A", "B", "A", "C")
bar <- c("A", "B","A", "B", "A", "C", "C", "C", "C", "C")
baz <- c(rep("A", 5), rep("C", 5))
qux <- c("A", "C", "A", "C", "A", "C")
first_position <- "A"
last_position <- "C"

score 1 · Accepted Answer · answered Apr 18 '18 at 06:58

Here is one option with rle. The idea would be to subset the 1st and last values, check whether it is equal to 'A', 'C', assign it to NA and convert that to a logical vector for subsetting

i1 <- !is.na(inverse.rle(within.list(rle(foo$position), 
     values[c(1, length(values))][values[c(1, length(values))] == c("A", "C")] <- NA)))
foo[i1, , drop = FALSE]
#    position
#4        B
#5        B
#6        A
#7        B
#8        A

chinsoon12 · Answer 3 · 2018-04-18T08:07:01.227

0

Another possible solution without rle by creating an index and subsetting rows to between first occurrence of non-A and last occurrence of non-C:

library(data.table)
output <- function(DT) {
    DT[, rn:=.I][,{
            mn <- min(which(position!="A"))
            mx <- max(which(position!="C"))
            if (mn > mx) return(NULL)
            .SD[mn:mx]
        }]
}

output(setDT(foo))
#   position rn
#1:        B  4
#2:        B  5
#3:        A  6
#4:        B  7
#5:        A  8

output(setDT(baz))
#NULL

data:

foo <- fread("position
A
A
A
B
B
A
B
A
C
C")

baz <- fread("position
A
A
A
A
A
C
C
C
C
C")

edited Apr 18 '18 at 08:07

answered Apr 18 '18 at 07:05

chinsoon12

25,005
4
25
35

1

Take care of coner case. This doesn't work for `baz`. – mt1022 Apr 18 '18 at 07:50
1

thanks, @mt1022. I didnt notice the scroll bar on the right and was scratching my head abt `baz` and `qux` – chinsoon12 Apr 18 '18 at 08:07

Sotos · Answer 4 · 2018-04-18T07:12:34.850

0

A data.table approach could be,

library(data.table)

setDT(df)[, grp := rleid(position)][
  !(grp == 1 & position == 'A' | grp == max(grp) & position == 'C'), ][
    , grp := NULL][]

which gives,

   position
1:        B
2:        B
3:        A
4:        B
5:        A

edited Apr 18 '18 at 07:12

answered Apr 18 '18 at 07:06

Sotos

51,121
6
32
66

score 0 · Answer 5 · answered Apr 18 '18 at 13:51

The problem seems to be two-fold. Triming 'first' and 'last' elements, and identifying what constitutes 'first' and 'last'. I like your rle() approach, because it maps many possibilities into a common structure. So the task is to write a function to mask the first and last elements of a vector of any length

mask_end = function(x) {
    n = length(x)
    mask = !logical(n)
    mask[c(min(1, n), max(0, n))] = FALSE  # allow for 0-length x
    mask
}

This is very easy to test comprehensively

> mask_end(integer(0))
logical(0)
> mask_end(integer(1))
[1] FALSE
> mask_end(integer(2))
[1] FALSE FALSE
> mask_end(integer(3))
[1] FALSE  TRUE FALSE
> mask_end(integer(4))
[1] FALSE  TRUE  TRUE FALSE

The solution (returning the mask; easy to modify to return the actual values, x[inverse.rle(r)]) is then

mask_end_runs = function(x) {
    r = rle(x)
    r$values = mask_end(r$values)
    inverse.rle(r)
}

Using rle to eliminate first and last sequences

5 Answers5