2

Background

Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:

import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']

my effort

I made a gawk script that solves the problem, but without to use regular expressions:

echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
  FS=""
}
{
  homopolymer = $1;
  base = $1;
  for(i=2; i<=NF; i++){
    if($i == base){
      homopolymer = homopolymer""base;
    }else{
      print homopolymer;
      homopolymer = $i;
      base = $i;
    }
  }
  print homopolymer;
}'

output

A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA

question

how can I use regular expressions in awk or sed, getting the same result ?

Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62

2 Answers2

6

grep -o will get you that in one-line:

echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA

Explanation:

([A-Z])   # matches and captures a letter in matched group #1
\1*       # matches 0 or more of captured group #1 using back-reference \1

sed is not the best tool for this but since OP has asked for it:

echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA

PS: This is gnu-sed.

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Try using split and just comparing.

echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
  for (i=1; i <= length($0); i++) {
    if (chars[i]!=chars[i+1])
    {
      printf("%s\n", chars[i])
    }
   else
   { 
     printf("%s", chars[i])
   }
  }
 }' 

A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA

EXPLANATION

The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.