1

I've been enjoying the powerful function aregexec that allows me to mine strings in a fuzzy way.

For that I can search for a string of nucleotide "ATGGCTTCGTC" within a DNA section with defined allowance of insertion, deletion and substitute.

However, it only show me the first match without finishing the whole string. For example,

If I run

aregexec("a","adfasdfasdfaa")

only the first "a" will show up from the result. I'd like to see all the matches.

I wonder if there are other more powerful functions or a argument to be added to this one.

Thank you very much.

P.S. I explained the fuzzy search poorly. I mean, the match doesn't have to be perfect. Say if I allow an substitution of one character, and search AATTGG in ctagtactaAATGGGatctgct, the capital part will be considered a match. I can similarly allow insertions and deletions of certain characters.

1 Answers1

1

gregexpr will show every time there is the pattern in the string, like in this example.

gregexpr("as","adfasdfasdfaa")

There are many more information if you use ?grep in R, it will explain every aspect of using regex.

Gowachin
  • 1,251
  • 2
  • 9
  • 17
  • 1
    Hi, thank you so much for your answer. The problem with grepexpr is that I can't figure out how to use fuzzy search. For aregexec, an argument max=list(sub=1,del=2,insert=3) allows flexible search. – Field -Ye Tian Feb 20 '20 at 13:59
  • I don't understand what you mean by fuzzy search...what do you want to mine exactly ? – Gowachin Feb 20 '20 at 14:32
  • 1
    I see. I will edit the question. By fuzzy search I mean the match doesn't have to be perfect. Say if I allow an substitution of one character, and search AATTGG in ctagtactaAATGGGatctgct, the capital part will be considered a match. – Field -Ye Tian Feb 20 '20 at 14:47
  • Okay so maybe you need to design specifix regex? For example, this command wil match either AATTGG or AATGGG : `regexpr('AAT(T|G)GG','CTAGTACTAAATGGGATCTGCT')`. Do you have priors or regular sequences to match, not depending on what's between them? – Gowachin Feb 20 '20 at 15:18
  • 1
    Wow. Thank you so much for you help. Unfortunately, the biology is not that simple. I'd like to search, say 20 character within a DNA string of 2000. Among the 20, each character can be substituted with the other 3 nucleic acids, and in general I can tolerate 1-2 substitutions. I have a slow fix for the aregexec where I delete the match I found and run again. I wonder how do I suggest to the author of the aregexec function. – Field -Ye Tian Feb 20 '20 at 15:58
  • Well I'm a biologist in evolution too so I get the point about the difficulty ^^ Your case seems difficult to code with regex (when I think about it right now). Also this function is in the core or `R`, so I doubt you can change it, because many functions/package could depend on it....Don't you have regular pattern in your sequence ? Because 20 character with all 4 basis have `1.099512e+12` combinaisons possible.... – Gowachin Feb 20 '20 at 16:34
  • Thank you so much for your time and patience. I would try out my labor-intensive method. Also, I wonder if there are any ways to notify the authors of R language and/or aregexec, to make a function like aregexec2. – Field -Ye Tian Feb 21 '20 at 02:10