How to search a vector for a certain pattern

Question

I would like to have a function that would seach the vector for a specific patern "1 after 4" (that is "1" "4"). It should list all the found sequences and the print out the ration for every each one, their length, where does it start and end.

It should search a part of the vector equal than N>=8 for each pair number(1,4) in the following vector with these condiotions in mind:

1) a specific ratio like this of:

BigRatio= Number of (1,4)*N/(Number of (1)*Number of (4)) 
    has to be more or equal than 0.2 %

2) and the the ratio of (1,4) in the vector (average of

 SmallRadtio= (Number of 1 + Number of 4)/(length of sequence) for 0.3%

If the conditions are met, it should then print the sequence the rations for every match.

This is the vector:

vector=c(1,1,1,1,1,1,1,4,4,4,4,2,3,1,1,1,1,1,1,1,4,4,4,4,2,3,1,4,1,4,1,4,1,4,1,4,
1,4,1,4,4,2,3,1,1,1,1,4,1,1,1,4,4,4,4,2,3,1,1,4,1,4,1,4,1,1,1,4,4,4,4,2,3,3,1,1,
4,1,4,1,4,1,1,1,4,4,4,4,4,4,4,4,2,3,1,1,1,1,1,1,1,4,4,1,1,4,2,1,1,1,1,1,1,4,3,
2,4,2,1,5,6,2,3,1,2,4,1,2,3,1,1,1,1,1,1,1,2,3,4,5,1,2,3,4,1,1,1,1,1,1,2,3,4,1,1,
1,2,3,1,2,3,1,2,3,4,3,1,2,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,
4,1,4,4,2,3,1,1,1,1,4,1,1,1,3,1,1,1,1,4,1,1,1,3,1,1,1,1,4,1,1,1,4,1,1,1,3,1,1,
1,1,4,2,3,1,1,4,1,4,1,4)

vector2=as.character(vector)

I converted it to character becasuse I thought I would be more easier that way. I may be wrong.

My code/Progress so far

I was having two ideas about this:

1)The function could search 8 or more( I can choose in the function) numbers at once and then check the rations. And then give the informations about it if its a good piece of 8 numbers.

2)The other idea would be that there would be a scoring system giving 5 points for a pair of 1,4 and -1 for every other number. Then it should somehow give an estimate where these parts are and should find these segmentes. The problem with the first idea is that maybe there would be maybe a segment which has 40 %, and the next segment has 20 % and together they maybe have more than that. So I was trying to figure out how to escape this trap of negative positivies. Maybe the search system should check every number or pair number than a whole segment. This is more complicated, but again more precise.

With the code I am stuck how to make the function. I know the arguments should be vector and the desired length of the sequence I would like to search for ( if I go for the first ide).I think I have to use a for loop to count every number ( or two numbers) sothat I could check if the are equal to (1,4) and then "remeber" it calculate the length of that part. And of course search for every part in it for 1 ili 4 to calculate the rations for them.

I thought of using this kind of loop:

for (i in 1:length(vector)) {
    idx <- agrep(vector[i],x)
    matches[i] <- length(vector)

But I think it is wrong and not really right.

I am still new to programming and R.

Additional question:

How would the function look if it was used for a data frame? Would it change the search to specifi rows?Is posibble to convert a vector into a data frame?

EDIT:

Another example and clarification:

sample2=c("aaaaabababababababababababababababcabcbababc bcbabcbcdddcbcbcdcbcbcbdcb
          bcbcbcbdbdbcbcbcbccbbcbbcbcbcbcbcbcbabababababababccbbcbbcbcbcbcbcbcbdbdbcbcbcbccb
          bcbcbcbdbdbcbcbcbccbbcbbcbcbcbcbcbbababababababababababababacbcbacbcbcdcbcbcbdcbbcdaddabcbac
          cabcbabcbabcbcbbabbabababababababababababa")

nchar(sample2)

So this it what it should do:

1) idea

Search every 50 part of the string, that means this part first:
```
 "aaaaabababababababababababababababcabcbababcbcbabc"
```

and then this part ( the next sequence of 50 elements of that string)

  "bcbabcbcdddcbcbcdcbcbcbdcbbcbcbcbdbdbcbcbcbccbbcbb"

And to this for every other 50 elements of the string.

As you can see the second 50 elements have "ba" in it that match the condition. So that will not be shown because it does not meet the condition.

The next thing would be to check if it has the right condiotions met (that is for example >0.5 rations) by the formula mentioned abobe for a certain pattern and that is "ba" in this case. If it has "ba" more than >0.5 then it should print out that sequence, say when it starts, return the rations and so on. That should be in a data frame for example.

The next idea was to calculate what the optimal segment for >0.5 in this string would be. That means there would be a problem if in the first part of 50 elements there would be 0.4 of "ba" in it, and in the next 50 0.1 of "ba" in it right at the beginning of that part : Imaginary first 50 have at the end a lot of ba, but not enough:

   "aaaaabababababdcdcdcdacacbababababababababababababab"

The next 50 have a lot of the beginning:

   "bababababababcbcdcbcbcbdcbbcbcbcbdbdbcbcbcbccbbcbbcd"

So how to make this more optimal? Should there we scoring system for "ba" as explained above to find the optimal lenghts of a segment for satisfing the conditions?

Your vector starts with `1 1 1 1 1 1 1 4 4 4 4`. Is it counted as one group or the `1 4` at 7th and 8th positions is group No 1? — akrun, Nov 08 '14 at 16:37
It seems like there are 44 elements in vector that satisfies that condition — akrun, Nov 08 '14 at 16:43
Oh, I get it. I gave a bad example of what I want. The vector should be much bigger with many number, but I thought I would be better if it was a small vector to use as an example. Yes there are 44 elements, but I am interested in many segments I should split the vector in segments so that it satisfies the conditions. — , Nov 08 '14 at 16:46
It is not specific enough yet. You give a vector but you do not say what the correct answer would be. At the moment the criterion for an acceptable run-length of > 0.2% would seem to accept any run. Is that what was planned? — IRTFM, Nov 08 '14 at 16:47
@BondedDust I understand. I did not plan for this. The criterion can go up, this is just an arbitary and reproducible example I made. THe criterion should be high enough so that I can efficently split the vector into segments that satifie the criterion. I could give it a run because I have not made the function. I have an idea what the result should look like: A segment ( or more segments) from the vector which has at least for example 0.5 pairs of `1 4` so that I can say that there are that many segments which satisfie that condtion. — , Nov 08 '14 at 16:54
It also remains unclear how the two different cases of 1111144444 and 1414141414 get handled. Differently or equivalently? — IRTFM, Nov 08 '14 at 16:56
@BondedDust `1111144444` is not the wanted case. It is `1414141414` what is wanted, because it is one after the other one. That is a pair of numbers, `1 4` or (1,4). — , Nov 08 '14 at 16:58

IRTFM · Accepted Answer · 2014-11-08T18:21:24.973

I'm rather annoyed that after producing useful code still no upvote and the problem still seems ambiguous. The new example has linefeeds it it but it's not clear whater we are supposed to read these in as separate lines, since:

> nchar(readLines(textConnection(sample2)))
[1]  71  92 102  52

It's not that hard to split a long character value into smaller parts:

samp3 <- paste(rep("a", 300), collapse="")
mapply( substr, seq(1,nchar(samp3),by=50), seq(1,nchar(samp3),by=50)+49, MoreArgs=list(x=samp3))
[1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[2] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[3] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[4] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[5] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[6] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

If you want to progress in your academic pursuits you need to work on expressing a concrete example in a manner other can execute.

------------First attempt:

Here is some vectorized code that should produce the tools needed to do this. Finding the correct vectorized functions lets you move beyond the for-loop-mentality fostered by SAS and BASIC. Loops can be useful when needed but generally R programmers try to avoid them unless really needed. I'm not sure what the exact desired outcome is, but at least this should move the conversation forward:

# convert to single character item
collapsV <- paste0(vector,collapse="") 
pos14 <- gregexpr("14", collapsV)  # regex pattern matching
# look for runs of 2 differences , i.e. "14"'s next to each other
diff14_2 <- rle( diff(gregexpr("14", collapsV)[[1]]) ) 
#Run Length Encoding  ...# value is a two element list that looks like
#  lengths: int [1:22] 1 1 6 1 1 1 2 1 1 2 ...
#  values : int [1:22] 13 7 2 8 4 8 2 4 9 2 ...

which( diff14_2$values==2 & diff14_2$lengths>4)
[1]  3 16

So the third gregexpr "hit" will be the position in "vector" of the first 14141414 run that is at least 4 pairs long. Check it:

> pos14[[1]][3]
[1] 27
> vector[27:40]
 [1] 1 4 1 4 1 4 1 4 1 4 1 4 1 4
> vector[25:40]
 [1] 2 3 1 4 1 4 1 4 1 4 1 4 1 4 1 4

And 16 is the second position in the gregexpr value that refers back to the position in "vector":

> pos14[[1]][16]
[1] 76
> vector[76:(76+8)]
[1] 1 4 1 4 1 4 1 1 1

You should print out all the intermediate values to see what is happening.

I will make an edit to explain more about what I want to do with another example. — , Nov 08 '14 at 17:37

How to search a vector for a certain pattern

1 Answers1