Separating a file by lines in python

Question

I have a .fastq file (cannot use Biopython) that consists of multiple samples in different lines. The file contents look like this:

@sample1
ACGTC.....
+
IIIIDDDDDFF
@sample2
AGCGC....
+
IIIIIDFDFD
.
.
.
@sampleX
ACATAG
+
IIIIIDDDFFF

I want to take the file and separate out each individual set of samples (i.e. lines 1-4, 5-8 and so on until the end of the file) and write each of them to a separate file (i.e. sample1.fastq contains that contents of sample 1 lines 1-4 and so on). Is this doable using loops in python?

https://stackoverflow.com/questions/20580657/how-to-read-a-fasta-file-in-python — Boris Verkhovskiy, May 13 '20 at 19:32
You could read and or copy/paste the source code of the FASTA parser from Biopython https://github.com/biopython/biopython/blob/301498dbdfa413cb14891e7a904d9635a63237b5/Bio/SeqIO/FastaIO.py#L188 — Boris Verkhovskiy, May 13 '20 at 19:35
I am not allowed to use Biopython. I will try to modify the code from thestackoverflow link you provided. — exracon, May 13 '20 at 19:48

ItsDrike · Answer 1 · 2020-05-13T19:59:20.400

You can use defaultdict and regex for this

import re
from collections import defaultdict

# Get file contents
with open("test.fastq", "r") as f:
    content = f.read()

samples = defaultdict(list) # Make defaultdict of empty lists
identifier = ""

# Iterate through every line in file
for line in content.split("\n"):
    # Find strings which start with @
    if re.match("^@.*", line):
        # Set identifier to match following lines to this section
        identifier = line.replace("@", "")
    else:
        # Add the line to its identifier
        samples[identifier].append(line)

Now all you have to do is save the contents of this default dictionary into multiple files:

# Loop through all samples (and their contents)
for sample_name, sample_items in samples.items():
    # Create new file with the name of its sample_name.fastq
    # (You might want to change the naming)
    with open(f"{sample_name}.fastq", "w") as f:
        # Write each element of the sample_items to new line
        f.write("\n".join(sample_items))

It might be helpful for you to also include @sample_name in the beginning of the file (first line), but I'm not sure you want that so I haven't added that.

Note that you can adjust the regex settings to only match @sample[number] instead of all @..., if you want that, you can use re.match("^@sample\d+") instead

Separating a file by lines in python

1 Answers1