How to remove duplicates from fasta file but keep at least one per group based on header

Question

I have a multifasta file that looks like this:

( all sequences are >100bp, more than one line, and same lenght )

>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage3_samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage3_samplenameD
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA

I need to remove the duplicates BUT keep at least on sequence per lineage. So in this simple example (Notice samplenameA,C and D are identical) above I would want to remove only samplenameD or samplenameC but not both of them. In the end I want to get the same header information as in the original file.

Example output:

>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage3_samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA

I found out a way that works to remove just the duplicates. Thanks to Pierre Lindenbaum.

sed -e '/^>/s/$/@/' -e 's/^>/#/'
file.fasta  |\
tr -d '\n' | tr "#" "\n" | tr "@"
"\t" |\
sort -u -t '  ' -f -k 2,2  |\
sed -e 's/^/>/' -e 's/\t/\n/'

Running this on my example above would result in:

>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG

—> so losing the lineage 3 sequence

Now I’m just looking for a quick solution to remove duplicates but keep at least one sequence per lineage based on the fasta header.

I’m new to scripting... any ideas in bash/python/R are welcome.

Thanks!!!

Hi, generally what you wrote should work. Could you be more specific about the solution you need? If the bash solution you provided is good enough, feel free to use the "Answer your own question" button. This will help others in the future to find you solution :) For future questions you may want to avoid the specific domain jargon, if possible. It may make it more difficult for other to help. Good luck and welcome! :) — Ruslan, Jul 25 '20 at 20:31
@Ruslan Thanks but the solution I provided removes all duplicate sequences from the fasta file. I do not want that. I want to remove the duplicates keeping at least one per specific lineage. It is possible that a whole lineage is lost when removing duplicates because it consists of the same sequences as some samples from the other lineages. This way I lose a lot of valuable information. As I said in my question/example above I don’t want to remove both C and D because I need to keep one for lineage 3 for example. I really dont see how else I can explain.. hope it’s more clear now. :) — Xela Vi, Jul 25 '20 at 20:39
Can you edit the post to provide for the example desired output? I can think of several solutions, but it really depends on the nature of your solution: can your FASTA be multiple-lines? Do you need to retain any lineage information, or would the unique sequences suffice? etc. — Ruslan, Jul 25 '20 at 20:46

Ruslan · Accepted Answer · 2020-07-25T21:37:35.467

In this case I can see two relatively good alternatives. A) look into existing tools (such as the Biopython library, or FASTX toolkit. I think both of them have good commands to do most of the work here, so it may be worthwhile to learn them. Or, B) write your own. In this case you may want to try (I'll stick to python):

loop over the file, line-by-line, and add the lineage/sequence data to a dictionary. I suggest using the sequence as a key. This way, you can easily know if you already encountered this key.

myfasta = {}
if myfasta[sequence]:
    myfasta[sequence].append(lineage_id)
else:
    myfasta[sequence] = [lineage_id]

This way your key (sequence) will hold the list of lineage_ids that have the same sequence. Note that the annoying bits of this solution will be to loop over the file, separate lineage-id from sequence, account for sequences that may extend to multiple lines, etc.

After that, you can loop over the dictionary, and write the sequences to file by using only the first lineage_id from the list within the dictionary.

Hi @Xela Vi, if you found my answer useful please consider accepting the answer (by clicking on the "V" next to the post. Good luck! — Ruslan, Jul 30 '20 at 11:22

How to remove duplicates from fasta file but keep at least one per group based on header

1 Answers1