-1

I have a document with this structure (it's large, more than 20000 lines)

@A00627:308:H227VDSX3:1:1201:30734:26349 2:N:0:TGGCAGTA+GTACAGTG
CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFF:F:FFFFFFFFFFFF
@A00627:308:H227VDSX3:1:1257:18828:34695 2:N:0:TGGCAGTA+GTACAGTG
CTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFF,FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00627:308:H227VDSX3:1:1266:28809:10300 2:N:0:TGGCAGTA+GTACAGTG
CTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGAAACCCACTGGGTGCCCG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFF,FFFFF:,F:FFFFFFF
@A00627:308:H227VDSX3:1:1447:29315:13745 2:N:0:TGGCAGTA+GTACAGTG
CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT
+

And I want to keep these lines starting with 2 @ and the next one. Like this:

    @A00627:308:H227VDSX3:1:1201:30734:26349 2:N:0:TGGCAGTA+GTACAGTG
    CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT
   
    
    @A00627:308:H227VDSX3:1:1257:18828:34695 2:N:0:TGGCAGTA+GTACAGTG
    CTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGA
    
    
    @A00627:308:H227VDSX3:1:1266:28809:10300 2:N:0:TGGCAGTA+GTACAGTG
    CTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGAAACCCACTGGGTGCCCG
    
    
    @A00627:308:H227VDSX3:1:1447:29315:13745 2:N:0:TGGCAGTA+GTACAGTG
    CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT

I have tried this code:

import fileinput
from collections import deque
output_file = 'cola1_fasta.txt' 
buscado = '@'

contexto = deque([], 3)  # for keeping the last 4 lines


with open(output_file, "w") as f_out:
    for line in fileinput.input(files=["cola1.txt"]):
        contexto.append(line)       
        if len(contexto) < 3:      
            continue
        if buscado in contexto[1]:  
            f_out.writelines(contexto) 

But I can obtain this. Do you have any suggestion? Many thanks!!

Sebastian
  • 55
  • 4

2 Answers2

1

Loop over the input file line by line, check if the line starts with @, if so, write that line to file, and set the header_row flag to True so on the next iteration we know to write next line to file.

input_filename = r"cola1.txt"
output_filename = r"cola1_fasta.txt"

header_row = False
with open(input_filename) as in_f:
    with open(output_filename, "wt") as out_f:
        for line in in_f:
            if line.startswith("@"):
                out_f.write(line)
                header_row = True
            elif header_row:
                out_f.write(line)
                header_row = False
            else:
                out_f.write("\n")

cola1_fasta.txt:

@A00627:308:H227VDSX3:1:1201:30734:26349 2:N:0:TGGCAGTA+GTACAGTG
CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT


@A00627:308:H227VDSX3:1:1257:18828:34695 2:N:0:TGGCAGTA+GTACAGTG
CTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGA


@A00627:308:H227VDSX3:1:1266:28809:10300 2:N:0:TGGCAGTA+GTACAGTG
CTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGAAACCCACTGGGTGCCCG


@A00627:308:H227VDSX3:1:1447:29315:13745 2:N:0:TGGCAGTA+GTACAGTG
CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT

Note this implementation results in 2 extra blank lines at the bottom of the text file.

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
GordonAitchJay
  • 4,640
  • 1
  • 14
  • 16
1

Take advantage of the fact that files are iterators in Python. So loop the file lin-by-line, check if the line starts with @ then write that line and the following one (using next) to the output file:

with open(output_file, 'w') as out_file, open(input_file) as in_file):
    for line in in_file:
        if line.startswith('@'):
            out_file.write(line)
            out_file.write(next(in_file)
        else:
            out_file.write('\n')
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61