Count patterns and differentiate them

Question

I'd like to count a defined pattern (here: 'Y') in a string for each row of a dataframe. Ideally, I'd like to get a number of occurrences in V3 and length in V4.

Input:

V1  V2
A   XXYYYYY
B   XXYYXX
C   XYXXYX
D   XYYXYX

Output:

V1       V2 V3   V4
 A  XXYYYYY  1    5
 B   XXYYXX  1    2
 C   XYXXYX  2  1,1
 D   XYYXYX  2  2,1

I tried different modifications of the function below, with no success.

dict <- setNames(nm=c("Y"))
seqs <- df$V2
sapply(dict, str_count, string=seqs)

Thanks in advance!

The `str_` functions should all be vectorised I believe. No need to `sapply` them. Also, `gregexpr("Y", df$V2)` should essentially give this in base R. — thelatemail, Jan 17 '16 at 23:26
thanks, but you solution gives position of 'Y', not number of occurrences and/or length — user2904120, Jan 17 '16 at 23:35
@thelatemail If you change the pattern to `"Y+"` the match length will be captured correctly. — steveb, Jan 17 '16 at 23:35
@user2904120 The last row of your output 'D' is missing a 'Y' (see the input). — steveb, Jan 17 '16 at 23:38

stas g · Answer 1 · 2016-01-18T00:48:15.720

another base R solution but using regexpr:

df <- data.frame(
  V1 = c("A", "B", "C", "D"),
  V2 = c("XXYYYYY", "XXYYXX" , "XYXXYX", "XYYXYX")
)

extract match.length attribute of the regexpr output, then count length of each attribute (which tells you how many matches there are):

r <- gregexpr("Y+", df$V2)
len <- lapply(r, FUN = function(x) as.array((attributes(x)[[1]])))
df$V3 <- lengths(len)
df$V4 <- len

df
#V1      V2 V3   V4
#1  A XXYYYYY  1    5
#2  B  XXYYXX  1    2
#3  C  XYXXYX  2 1, 1
#4  D  XYYXYX  2 2, 1

if you have an old version of R that doesn't have lengths yet you can use df$V3 <- sapply(len, length) instead. and if you need a more generic function to do the same for any vector x and pattern a:

foo <- function(x, a){
  ans <- data.frame(x)
  r <- gregexpr(a, x)
  len <- lapply(r, FUN = function(z) as.array((attributes(z)[[1]])))
  ans$quantity <- lengths(len)
  ans$lengths <- len
  ans
}

try foo(df$V2, 'Y+').

score 1 · Accepted Answer · answered Jan 17 '16 at 23:54

1

Here is a stringr solution:

df <- data.frame(
  V1 = c("A", "B", "C", "D"),
  V2 = c("XXYYYYY", "XXYYXX" , "XYXXYX", "XYYXYX")
  )

df$V3 <- str_count(df$V2, "Y+")

df$V4 <- lapply(str_locate_all(df$V2, "Y+"), function(x) {
    paste(x[, 2] - x[, 1] + 1, collapse = ",")
  })

answered Jan 17 '16 at 23:54

johnson-shuffle

1,023
5
11

in df$V3 <- str_count(df$V2, "Y+") how to specify a random character, e.g. Y*Y, so the search comes up with XYYXYX that contains YXY – user2904120 Feb 01 '17 at 11:44

score 1 · Answer 3 · answered Jan 18 '16 at 00:10

1

In base R:

aaa <- data.frame(V1 = LETTERS[1:4], 
                  V2 = c("XXYYYYY", "XXYYXX", "XYXXYX", "XYYXYX"),
                  stringsAsFactors = FALSE)

# split into strings of "Y"s
splt <- lapply(aaa$V2, function(x) unlist(strsplit(x, "[^Y]+"))[-1])

# number of occurrences
aaa$V3 <- lapply(splt, length)

# length of each occurence
aaa$V4 <- lapply(splt, function(x) paste(nchar(x), collapse = ","))

answered Jan 18 '16 at 00:10

GL_Li

1,758
1
11
25

any idea how to run it as one function, so one can simply specify a pattern e.g. "Y" , "YY", or "X" ? – user2904120 Jan 18 '16 at 00:17
@user2904120 see my answer below for a function – stas g Jan 18 '16 at 00:46

Count patterns and differentiate them

3 Answers3