9

Suppose I have a vector of values, such as:

A C A B A C C B B C C A A A B B B B C A

I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,

NA NA  2 NA  2  4  1  4  1  3  1  7  1  1  6  1  1  1  8  6

(where NA indicates that this is the first time the element has been seen).

For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.

Is there a pre-built pipe-compatible function that does this?

I hacked together this function to demonstrate:

# For reproducibility
set.seed(1)

# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)


compute_lag_counts = function(x, first_time = NA){
  # return vector to fill
  lag_counts = rep(-1, length(x))
  # values to match
  vals = unique(x)
  # find all positions of all elements in the target vector
  match_list = grr::matches(vals, x, list = TRUE)
  # compute the lags, then put them in the appropriate place in the return vector
  for(i in seq_along(match_list))
    lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
  
  # return vector
  return(lag_counts)
}

compute_lag_counts(x)

Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.

Henrik
  • 65,555
  • 14
  • 143
  • 159
richarddmorey
  • 976
  • 6
  • 19

3 Answers3

8

Or

ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
#  [1] NA NA  2 NA  2  4  1  4  1  3  1  7  1  1  6  1  1  1  8  6

We calculate the first difference of the indices for each group of x.


A data.table option thanks to @Henrik

library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt
markus
  • 25,843
  • 5
  • 39
  • 58
  • 4
    I was writing on a `data.table` alternative in a similar vein, but you were faster: `dt = data.table(x)`; `dt[ , d := .I - shift(.I), x]`. – Henrik Jul 06 '20 at 20:31
3

Here's a function that would work

compute_lag_counts <- function(x) {
  seqs <- split(seq_along(x), x)
  unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}

compute_lag_counts (x)
# [1] NA NA  2 NA  2  4  1  4  1  3  1  7  1  1  6  1  1  1  8  6

Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
2

An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector

library(dplyr)
tibble(v1) %>% 
   mutate(ind = row_number()) %>%
   group_by(v1) %>% 
   mutate(new = ind - lag(ind)) %>%
   pull(new)
#[1] NA NA  2 NA  2  4  1  4  1  3  1  7  1  1  6  1  1  1  8  6

data

v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A", 
"A", "A", "B", "B", "B", "B", "C", "A")
akrun
  • 874,273
  • 37
  • 540
  • 662