4

I apologize if the question has been asked before, but I have been searching for days and could not find a solution in Python.

I have a large fasta file, containing headers and sequences.

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

>cavPor3_rmsk_tRNA-Ser-TCA(m) range=chrM:6875-6940 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

This is a very small fragment of what the file looks like. I want to keep only the first entry (header and sequence) if, as you can see for the last two entries, the sequences are the same.

The output would look like this:

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

The problem is that the FASTA file is over one gigabyte in size. I have found ways of solving this by removing duplicates based on duplicate IDs or by using bash, but sadly I can't do this on my computer. This task is for a research project, not a homework or task.

Thank you in advance for your help!

MattDMo
  • 100,794
  • 21
  • 241
  • 231
  • shouldnt fasta file be like: >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH – pippo1980 Mar 03 '21 at 18:59
  • 1
    https://bioinformatics.stackexchange.com/questions/13708/remove-redundant-sequences-from-fasta-file-in-python – pippo1980 Mar 03 '21 at 19:02
  • so you should read first sequence and append to new file then read second sequence check new file for it and if not present append it to it and so on until the end of first file – pippo1980 Mar 03 '21 at 19:27
  • https://www.biostars.org/p/250123/ Question: Best way to processing huge fasta files using python. – pippo1980 Mar 03 '21 at 19:35
  • https://www.biostars.org/p/710/ : Question: Correct Way To Parse A Fasta File In Python – pippo1980 Mar 03 '21 at 20:10
  • according to above try to get the size of the ''fasta_sequences = SeqIO.parse(open(input_file),'fasta')'' object and see how big is for your GB file, then create it just once and append sequence 1 to new file in a loop that recreate the parser of the output file each time to check if the n sequence of the input file is to be appended to the output file. Am I wrong ? the logic seems OK to me . Not sure it will work. And there is the > problem above all – pippo1980 Mar 03 '21 at 20:12
  • out of curiosity its not my field, would a swap file help with huge file ? I mean can a system put parts of an active object (or whatever its called ) to swap file instead of memory ? I am not interested in the system being slow as a snail just be sure it can be done – pippo1980 Mar 03 '21 at 20:52
  • https://biopython.org/wiki/Split_large_file – pippo1980 Mar 05 '21 at 23:16
  • consider using a database ? not sure if can work but seems to be out there https://www.tutorialspoint.com/biopython/biopython_biosql_module.htm – pippo1980 Mar 08 '21 at 09:36

3 Answers3

5

this copied from here: Remove Redundant Sequences from FASTA file in Python

uses Biopython, but works with fasta file where headers are of the:

'> header' type see FAsta Format Wiki

from Bio import SeqIO
import time

start = time.time() 

seen = []
records = []

for record in SeqIO.parse("INPUT-FILE", "fasta"):  
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#writing to a fasta file
SeqIO.write(records, "OUTPUT-FILE", "fasta")
end = time.time()

print(f"Run time is {(end- start)/60}") 


faster as suggested by MattMDo using a set istead of a list:

seen = set()
records = []

for record in SeqIO.parse("b4r2.fasta", "fasta"):  
    if record.seq not in seen:
        seen.add(record.seq)
        records.append(record)

I've got a longer one that uses argparser but its slower because counts the sequences can post it if needed

pippo1980
  • 2,181
  • 3
  • 14
  • 30
  • The OP's FASTA entries are properly formatted, it just wasn't showing up due to the `>` character indicating a blockquote in Markdown. I have reformatted the original post. – MattDMo Mar 04 '21 at 01:17
  • Regarding your code, for large files it would be *much* more efficient if `seen` was a set instead of a list. – MattDMo Mar 04 '21 at 01:20
  • @MattDMo and how will it cope with the GB size file problem ?? – pippo1980 Mar 04 '21 at 07:16
  • It depends on how much memory the machine has, but any reasonably recent machine should have at least 2-4GB of RAM, so there shouldn't be a problem here. In general, though, if the size of a data structure exceeds available memory, you'll get a `MemoryError`. – MattDMo Mar 04 '21 at 16:28
  • @MattDMo https://www.biostars.org/p/710/: Correct Way To Parse A Fasta File In Python. According to this would my idea be feasible (dont care if its slow as hell): in case of not enough memory create ''fasta_sequences = SeqIO.parse(open(input_file),'fasta')'' just once and append sequence 1 to a new file in a loop that recreate the parser of the output file each time to check if the n sequence of the input file is to be appended to the output file. Am I wrong ? Where can I get a 2GB fasta file to try it out ? Most sequence repos have size limit even for blast alignment results – pippo1980 Mar 04 '21 at 17:27
  • @MattDMo Do you have time to have a look at https://bioinformatics.stackexchange.com/questions/15494/remove-redundant-sequences-from-fasta-file-with-biopython-reducing-memory-footpr ? I am on no more questions ban on SO. Cheers P. – pippo1980 Mar 07 '21 at 18:04
1

If the lines all have that same format, so that there are 6 space-separated fields before the sequences, then this is easy. You will have to store all of the unique values in memory.

memory = set()
for line in open('x.txt'):
    if len(line) < 5:
        continue
    parts = line.split(' ', 6)
    if parts[-1] not in memory:
        print( line.strip() )
        memory.add( parts[-1] )
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
0

if you want to keep both header while removing duplicates you can use:

input1 = open("fasta.fasta")
dict_fasta = {record.id:record.seq for record in SeqIO.parse(input1,"fasta")}

tmp = {}
for key, value in dict_fasta.items():
  if value in tmp:
    tmp[value].append(key)
  else:
    tmp[value] = [ key ]