How do I write a for-loop so a program reiterates itself for a set of 94 DNA samples?

Question

I have written some code in a bash shell (so I can submit it to my university's supercomputer) to edit out contaminant sequences from a batch of DNA extracts I have. Essentially what this code does is take the sequences from the negative extraction blank I did (A1-BLANK) and subtract it from all of the other samples.

I have figured out how to get this to work with individual samples, but I'm attempting to write a for loop so that the little chunks of code will reiterate themselves for each sample, with the outcome of this file being a .sam file with a unique name for each sample where both the forward and reverse reads for the sample are merged and edited for contamination.I have checked stack overflow extensively for help with this specific problem, but haven't been able to apply related answered questions to my code.

Here's an example of part what I'm trying to do for an individual sample, named F10-61C-3-V4_S78_L001_R1_001.fastq:

bowtie2 -q --end-to-end --very-sensitive \ ##bowtie2 is a program that examines sequence similarity compared to a standard
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \ ##This line compares the sample to the negative extraction blank
-1  /file directory/F10-61C-3-V4_S78_L001_R1_001.fastq 
-2 /file directory/F10-61C-3-V4_S78_L001_R2_001.fastq \ ##These two lines above merge the forward and reverse reads of the DNA sequences within the individual files into one file
-S 61C-3.sam ##This line renames the merged and edited file and transforms it into a .sam file

Here's what I've got so far for this little step of the process:


for file in /file directory/*.fastq

do

bowtie2 -q --end-to-end --very-sensitive \
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \
-1  /file directory/*.fastq 
-2 /file directory/*.fastq \
-S *.sam

done

In my resulting slurm file, the error I'm getting right now has to do with the -S command. I'm not sure how to give each merged and edited sample a unique name for the .sam file. I'm new to writing for loops in python (my only experience is in R) and I'm sure it's a simple fix, but I haven't been able to find any specific answers to this question.

score 1 · Answer 1 · 2019-08-07T19:33:33.420

Here's a first try. Note I assume the entire fragment between do and done is one command, and therefore needs continuation markers (\).

Also note in my example "$file" occurs twice. I feel a bit uneasy about this, but you seem to explicity need this in your described example.

And finally note I am giving the sam file just a numeric name, because I don't really know what you would like that name to be.

I hope this provides enough information to get you started.

#!/bin/bash
i=0
for file in /file/directory/*.fastq
do
     bowtie2 -q --end-to-end --very-sensitive \
      -N 0 -L 31 --time --reorder \
      -x A1-BlankIndex \
      -1 "$file"  \
      -2 "$file" \
      -S "$i".sam
      i=$((i+1))
done

In his example @KimL. had R1 sample and R2 sample respectively to the `-1` and `-2` options arguments. But your script gives the same `$file` name for both. `bowtie2` needs a forward and a reverse sample pair. I added an answer witch iterates R1 samples and automatically select the corresponding R2 sample, as well as extract the referrence for the output merged sample. — Léa Gris, Aug 07 '19 at 20:10

score 0 · Accepted Answer · answered Aug 07 '19 at 20:02

This may work as your example but automatically select the output file name referrence with a RegEx:

#!/usr/bin/env bash

input_samples='/input_samples_directory'
output_samples='/output_merged_samples_directory'

while IFS= read -r -d '' R1_fastq; do
  # Deduce R2 sample from R1 sample file name
  R2_fastq="${R1_fastq/_R1_/_R2_}"
  # RegEx match capture group in () for the output sample reference
  [[ $R1_fastq =~ [^-]+-([[:digit:]]+[[:alpha:]]-[[:digit:]]).* ]]
  # Construct the output sample file path with the captured referrenced
  # from the RegEx above
  sam="$output_samples/${BASH_REMATCH[1]}.sam"
  # Perform the merging
  bowtie2 -q --end-to-end --very-sensitive \
    -N 0 -L 31 --time --reorder \
    -x A1-BlankIndex \
    -1 "$R1_fastq" \
    -2 "$R2_fastq" \
    -S "$sam"
done < <(find "$input_samples" -maxdepth 1 -type -f -name '*_R1_*.fastq' -print0)

O.K first off I want to extend my gratitude for you taking the time to help me with my question. The regular expression format really helped me figure this out! I know what REs are, but haven't used them before, so thanks a bunch for clarifying this for me. As of now, this format appears to have worked! It's probably going to take a while for the supercomputer to process all of the sequences, but I will keep you updated on my results. Thanks again! — Kim L., Aug 21 '19 at 20:28

How do I write a for-loop so a program reiterates itself for a set of 94 DNA samples?

2 Answers2