I would like to have a function that would seach the vector
for a specific patern "1 after 4" (that is "1" "4"). It should list all the found sequences and the print
out the ration for every each one, their length, where does it start and end.
It should search a part of the vector equal than N>=8 for each pair number(1,4) in the following vector with these condiotions in mind:
1) a specific ratio like this of:
BigRatio= Number of (1,4)*N/(Number of (1)*Number of (4))
has to be more or equal than 0.2 %
2) and the the ratio of (1,4)
in the vector (average of
SmallRadtio= (Number of 1 + Number of 4)/(length of sequence) for 0.3%
If the conditions are met, it should then print the sequence the rations for every match.
This is the vector:
vector=c(1,1,1,1,1,1,1,4,4,4,4,2,3,1,1,1,1,1,1,1,4,4,4,4,2,3,1,4,1,4,1,4,1,4,1,4,
1,4,1,4,4,2,3,1,1,1,1,4,1,1,1,4,4,4,4,2,3,1,1,4,1,4,1,4,1,1,1,4,4,4,4,2,3,3,1,1,
4,1,4,1,4,1,1,1,4,4,4,4,4,4,4,4,2,3,1,1,1,1,1,1,1,4,4,1,1,4,2,1,1,1,1,1,1,4,3,
2,4,2,1,5,6,2,3,1,2,4,1,2,3,1,1,1,1,1,1,1,2,3,4,5,1,2,3,4,1,1,1,1,1,1,2,3,4,1,1,
1,2,3,1,2,3,1,2,3,4,3,1,2,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,
4,1,4,4,2,3,1,1,1,1,4,1,1,1,3,1,1,1,1,4,1,1,1,3,1,1,1,1,4,1,1,1,4,1,1,1,3,1,1,
1,1,4,2,3,1,1,4,1,4,1,4)
vector2=as.character(vector)
I converted it to character becasuse I thought I would be more easier that way. I may be wrong.
My code/Progress so far
I was having two ideas about this:
1)The function could search 8 or more( I can choose in the function) numbers at once and then check the rations. And then give the informations about it if its a good piece of 8 numbers.
2)The other idea would be that there would be a scoring system giving 5 points for a pair of 1,4 and -1 for every other number. Then it should somehow give an estimate where these parts are and should find these segmentes. The problem with the first idea is that maybe there would be maybe a segment which has 40 %, and the next segment has 20 % and together they maybe have more than that. So I was trying to figure out how to escape this trap of negative positivies. Maybe the search system should check every number or pair number than a whole segment. This is more complicated, but again more precise.
With the code I am stuck how to make the function. I know the arguments should be
vector
and the desired length of the sequence I would like to search for ( if I go for
the first ide).I think I have to use a for loop
to count every number ( or two numbers) sothat I could check if the are equal to (1,4) and then "remeber" it calculate
the length of that part. And of course search for every part in it for 1 ili 4 to
calculate the rations for them.
I thought of using this kind of loop:
for (i in 1:length(vector)) {
idx <- agrep(vector[i],x)
matches[i] <- length(vector)
But I think it is wrong and not really right.
I am still new to programming and R.
Additional question:
How would the function look if it was used for a data frame? Would it change the search to specifi rows?Is posibble to convert a vector into a data frame?
EDIT:
Another example and clarification:
sample2=c("aaaaabababababababababababababababcabcbababc bcbabcbcdddcbcbcdcbcbcbdcb
bcbcbcbdbdbcbcbcbccbbcbbcbcbcbcbcbcbabababababababccbbcbbcbcbcbcbcbcbdbdbcbcbcbccb
bcbcbcbdbdbcbcbcbccbbcbbcbcbcbcbcbbababababababababababababacbcbacbcbcdcbcbcbdcbbcdaddabcbac
cabcbabcbabcbcbbabbabababababababababababa")
nchar(sample2)
So this it what it should do:
1) idea
Search every 50 part of the string, that means this part first:
"aaaaabababababababababababababababcabcbababcbcbabc"
and then this part ( the next sequence of 50 elements of that string)
"bcbabcbcdddcbcbcdcbcbcbdcbbcbcbcbdbdbcbcbcbccbbcbb"
And to this for every other 50 elements of the string.
As you can see the second 50 elements have "ba" in it that match the condition. So that will not be shown because it does not meet the condition.
- The next thing would be to check if it has the right condiotions met (that is for example >0.5 rations) by the formula mentioned abobe for a certain pattern and that is "ba" in this case. If it has "ba" more than >0.5 then it should print out that sequence, say when it starts, return the rations and so on. That should be in a data frame for example.
The next idea was to calculate what the optimal segment for >0.5 in this string would be. That means there would be a problem if in the first part of 50 elements there would be 0.4 of "ba" in it, and in the next 50 0.1 of "ba" in it right at the beginning of that part : Imaginary first 50 have at the end a lot of ba, but not enough:
"aaaaabababababdcdcdcdacacbababababababababababababab"
The next 50 have a lot of the beginning:
"bababababababcbcdcbcbcbdcbbcbcbcbdbdbcbcbcbccbbcbbcd"
So how to make this more optimal? Should there we scoring system for "ba" as explained above to find the optimal lenghts of a segment for satisfing the conditions?