0

I have 36-nt reads like this: atcttgttcaatggccgatcXXXXgtcgacaatcaa in the fastq file with XXXX being the different barcodes. I want to search for a barcode in the file at exact position(21 to 24) and print the sequences with up to 3 mismatches in sequence not barcode.

For example: I have barcode: aacg search that barcode between position 21 to 24 in fastq file with allowing 3 mismatches in the sequence like:

atcttgttcaatggccgatcaacggtcgacaatcac # it has 1 mismatch
ttcttgttcaatggccgatcaacggtcgacaatcac # it has 2 mismatch
tccttgttcaatggccgatcaacggtcgacaatcac # it has 3 mismatch

I was trying to find unique lines first using awk and look for mismatches but it is very tedious for me to look and find them.

awk 'NR%4==2' 1.fq |sort|uniq -c|awk '{print $1"\t"$2}' > out1.txt

Is there any quick way i can find?

Thank you.

Leandro Papasidero
  • 3,728
  • 1
  • 18
  • 33
abh
  • 101
  • 1
  • 4
  • 13
  • I'm confused. What do barcodes have to do with nucleotide sequences? – Kevin Apr 17 '13 at 19:14
  • initially i was looking for barcodes for specific position and i was getting very low count,and with 1 mismatch in sequence i got high count.so,if i give mismatches in the sequence i will get more sequences(and i want to try upto 3) – abh Apr 17 '13 at 19:18
  • So you're scanning [barcodes](http://en.wikipedia.org/wiki/Barcode)? Like, the black and white stripey patterns that supermarket cashiers use to identify the price of items? Because I still have no idea how you can get DNA from a barcode. – Kevin Apr 17 '13 at 19:25
  • 1
    @kevin http://en.wikipedia.org/wiki/DNA_barcoding – danfuzz Apr 17 '13 at 21:29

3 Answers3

1

Using Python:

strs = "atcttgttcaatggccgatcaacggtcgacaatcaa"

with open("1.fq") as f:
    for line in f:
        if line[20:24] == "aacg":
            line = line.strip()
            mismatches = sum(x!=y for x, y  in zip(strs, line))
            if mismatches <= 3:
                print line, mismatches

atcttgttcaatggccgatcaacggtcgacaatcac 1
ttcttgttcaatggccgatcaacggtcgacaatcac 2
tccttgttcaatggccgatcaacggtcgacaatcac 3
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • ATCTTGTTCAATGGCCGATCAACGGTCGACAATCAA ATCTTGTTCAATGGCCGATCAACGCTCGACAATCAA 3 AGCTTGTTCAATGGCCGATCAACGCTCGACAATCAA 2 i'm confused with output here is some lines,in this the first one is original sequence and the number 3 has 1 mismatch and 2 has 2 mismatches.what i want is it should print similar sequence(say 0) and 1,2,3 mismatch sequences @Ashwini Chaudhary – abh Apr 17 '13 at 21:22
0

Using Python:

import re
seq="atcttgttcaatggccgatcaacggtcgacaatcaa"
D = [ c for c in seq ]
with open("input") as f:
    for line in f:
        line=line.rstrip('\n')
        if re.match(".{20}aacg", line):
            cnt = sum([ 1 for c,d in zip(line,D) if c != d])
            if cnt < 4:
                print cnt, line
perreal
  • 94,503
  • 21
  • 155
  • 181
  • i just looked at some output 1 ATCTTGTTCAATGGCCGATCAACGGCCGACAATCAA 2 ATCTTGTTCAATGGCCGATCAACGGTCGACAATCAA but the second one is similar to sequence and it is saying me 2 mismatches why? – abh Apr 17 '13 at 20:34
  • there must be 2 mismatches in the sequence – perreal Apr 17 '13 at 21:04
  • original ATCTTGTTCAATGGCCGATCAACGGTCGACAATCAA 0 AGCTTGTTCAATGGCCGATCAACGGCCGACAATCAA 1 AGCTTGTTCAATGGCCGATCAACGGTCGACAATCAA 2 ATCTTGTTCAATGGCCGATCAACGGTCGACAATCAA 3 ATCTTGATCAATGGCCGATCAACGGTCGACAATCAA here are the some of the lines i got @perreal – abh Apr 17 '13 at 21:29
  • 0, 2, 1 ,0, 1 is what I get for that data – perreal Apr 17 '13 at 21:33
  • yes.i got the same thing.i just want like if it find similar then 0,1 mismatch sequence with 1,2 mismatch sequence with 2 and 3 mismatch sequence with 3 @perreal – abh Apr 17 '13 at 21:37
  • sorry I don't really understand what you mean. Can you try to explain it again? – perreal Apr 17 '13 at 23:01
0

Using python regex module allows you to specify the number of mismatches

import regex #intended as a replacement for re
from Bio import SeqIO
import collections

d = collections.defaultdict(list)
motif = r'((atcttgttcaatggccgatc)(....)(gtcgacaatcaa)){e<4}' #e<4 = less than 4 errors
records = list(SeqIO.parse(open(infile), "fastq"))
for record in records:
    seq = str(record.seq)
    match = regex.search(motif, seq, regex.BESTMATCH)
    barcode = match.group(3)
    sequence = match.group(0)
    d[barcode].append(sequence) # store as a dictionary key = barcode, value = list of sequences
for k, v in d.items():
    print("barcode = %s" % (k))
    for i in v:
        print("sequence = %s" % (i))

using capture groups, the fourth group (3), will be the barcode

Colin Anthony
  • 1,141
  • 12
  • 21