how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Question

Background

Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:

import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']

my effort

I made a gawk script that solves the problem, but without to use regular expressions:

echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
  FS=""
}
{
  homopolymer = $1;
  base = $1;
  for(i=2; i<=NF; i++){
    if($i == base){
      homopolymer = homopolymer""base;
    }else{
      print homopolymer;
      homopolymer = $i;
      base = $i;
    }
  }
  print homopolymer;
}'

output

A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA

question

how can I use regular expressions in awk or sed, getting the same result ?

anubhava · Accepted Answer · 2015-05-25T16:19:24.483

6

grep -o will get you that in one-line:

echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA

Explanation:

([A-Z])   # matches and captures a letter in matched group #1
\1*       # matches 0 or more of captured group #1 using back-reference \1

sed is not the best tool for this but since OP has asked for it:

echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA

PS: This is gnu-sed.

edited May 25 '15 at 16:19

answered May 25 '15 at 16:07

anubhava

761,203
64
569
643

yes `echo "ACCCGGGTTTAACCGGACCCAA" | grep -oE 'A+|T+|C+|G+'` works well, but, I don't know how to do it with awk or sed – Jose Ricardo Bustos M. May 25 '15 at 16:10
1

@Jose is the right tool for the job in this case. It is unclear why you want to use sed or awk. – Tom Fenech May 25 '15 at 16:12
@JoseRicardoBustosM. `grep` is best suited for this but I've provided a `sed` solution also. – anubhava May 25 '15 at 16:15
@TomFenech you're right , I'm just learning how to use awk and sed .... and I could not do this – Jose Ricardo Bustos M. May 25 '15 at 16:15
Yes that's right. `&` is back-reference to whatever we matched in pattern and `\n` is adding a new line after match. – anubhava May 25 '15 at 16:19

Alejandro Teixeira Muñoz · Answer 2 · 2015-05-25T16:20:07.190

1

Try using split and just comparing.

echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
  for (i=1; i <= length($0); i++) {
    if (chars[i]!=chars[i+1])
    {
      printf("%s\n", chars[i])
    }
   else
   { 
     printf("%s", chars[i])
   }
  }
 }' 

A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA

EXPLANATION

The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.

edited May 25 '15 at 16:20

answered May 25 '15 at 16:14

Alejandro Teixeira Muñoz

2,758
1
22
31

my problem is using regular expressions in awk, but thank you very much – Jose Ricardo Bustos M. May 25 '15 at 16:23
oh i understood u wanted without them!! let me then a while!! :P – Alejandro Teixeira Muñoz May 25 '15 at 16:25

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

2 Answers2