1

I want to create a random string of length 50 of just four character 'e','l','i','r' which occur with frequency 0.6,0.1,0.1,0.2. After creating a random string I want to do simulation 5000 times and see the proportion in which 'el' occur in each simulation. I've created a random string using following command:

x <- paste(sample( c('e','l','i','r'), 50, replace=TRUE, prob=c(0.6,0.1, 0.1, 0.2) ),collapse = '')

But now I'm confused how to do simulations 5000 times. I googled that I could perhaps use replicate function and use that to find proportion. But I'm confused how to use it in this context. Help would be appreciated.

Shafa Haider
  • 413
  • 2
  • 5
  • 13
  • and how can I find proportion of 'el' in each simulation? Do I divide the number of time I find 'el' in each simulation and divide it by total length of the string? – Shafa Haider Sep 05 '18 at 22:15

2 Answers2

1

So you are finding co-occurrence. If you do not paste the result it would be much easier (and faster). I would adapt function foo defined in my answer to the linked Q & A.

set.seed(0)

## one example sample
char_tank <- c('e','l','i','r')
char_prob <- c(0.6, 0.1, 0.1, 0.2)
x <- sample(char_tank, 50, replace = TRUE, prob = char_prob)
# [1] "i" "e" "e" "e" "l" "e" "i" "l" "r" "r" "e" "e" "e" "r" "e" "r" "e" "r" "l"
#[20] "e" "r" "l" "e" "r" "e" "e" "e" "e" "e" "i" "e" "e" "e" "e" "e" "i" "r" "r"
#[39] "e" "r" "e" "i" "r" "r" "e" "e" "r" "e" "e" "r"

## adapted from function `foo` from https://stackoverflow.com/a/51695793/4891738
## the function is shorter because you just want to find not to remove matches
count_co_occurrence <- function (xm, xs) {
  nm <- length(xm)
  ns <- length(xs)
  shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
  d <- xm[shift_ind] == xs
  sum(.colSums(d, ns, length(d) / ns) == ns)
  }

count_co_occurrence(x, c('e', 'l'))
#[1] 1

In this case you see that x[4:5] is a match.


It is straightforward to use replicate + lapply / sapply / vapply to repeat the above.

## just do 10 simulations as a small example
set.seed(0)
xx <- replicate(10, sample(char_tank, 50, replace = TRUE, prob = char_prob),
                simplify = FALSE)
yy <- vapply(xx, count_co_occurrence, xs = c('e', 'l'), 0L)
# [1] 1 1 4 1 2 2 4 4 5 5

I am not sure how you would define "proportion" of co-occurrence. Is it yy / (50 - (2 - 1)) in this case?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
1

Starting with your code:

set.seed(2)
x <- paste(sample( c('e','l','i','r'), 50, replace=TRUE, prob=c(0.6,0.1, 0.1, 0.2) ),collapse = '')
x
# [1] "ereelleieeeereeileeereieeeeeleeeiieriereleeelrleei"

We can easily replicate this with:

set.seed(2)
xmany <- replicate(5000, paste(sample( c('e','l','i','r'), 50, replace=TRUE, prob=c(0.6,0.1, 0.1, 0.2) ),collapse = ''))
head(xmany)
# [1] "ereelleieeeereeileeereieeeeeleeeiieriereleeelrleei"
#4#       ^                       ^           ^   ^
# [2] "eerleirlrrrireieeeeeeeeeereieeereeeilereleeeeeeeee"
#1#                                           ^
# [3] "eelieeieeeereeiiieleeeliereereelelereieeeereerreee"
#5#     ^               ^   ^        ^ ^
# [4] "eelereieeeilerereleeleeiereerelelreiereeeeleeeeeee"
#6#     ^              ^  ^         ^ ^         ^
# [5] "irrleieeeeleirleeeeeeleerilerireieieeeeeieerlleeee"
#2#             ^          ^
# [6] "reeereeeerrereirerieiliereleeeelrreleereeerereeeee"
#3#                             ^    ^   ^

I've added the text to highlight the occurrences of "el" in each string.

If you need the number of occurrences of "el" within each string, then (without the head for everything):

ispos <- function(a) a > 0
head( lengths(Filter(ispos, gregexpr("el", xmany))) )
# [1] 4 1 5 6 2 3

Note: I created the ispos function because gregexpr will return -1 when no matches are made, which keeps the returned vector at length 1 or more. So by removing the negative elements, we get an honest return. (I could have used regmatches(gregexpr(...),xmany), but that seems like a lot more work than is necessary to get the number of occurrences.)

If you need the frequency table for it:

table( lengths(Filter(ispos, gregexpr("el", xmany))) )
#    0    1    2    3    4    5    6    7    8    9 
# 9701  891 1145 1241  936  459  178   69   16    1 
r2evans
  • 141,215
  • 6
  • 77
  • 149