I have an existing snakemake RNAseq workflow that works fine with a directory tree as below. I need to alter the workflow so that it can accommodate another layer of directories. Currently, I use a python script that os.walks the parent directory and creates a json file for the sample wildcards (json file for sample widlcards also included below). I am not very familiar with python, and it seems to me that adapting the code for an extra layer of directories shouldn't be too difficult and was hoping someone would be kind enough to point me in the right direction.
RNAseqTutorial/
├── Sample_70160
│ ├── 70160_ATTACTCG-TATAGCCT_S1_L001_R1_001.fastq.gz
│ └── 70160_ATTACTCG-TATAGCCT_S1_L001_R2_001.fastq.gz
├── Sample_70161
│ ├── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R1_001.fastq.gz
│ └── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R2_001.fastq.gz
├── Sample_70162
│ ├── 70162_CGCTCATT-ATAGAGGC_S3_L001_R1_001.fastq.gz
│ └── 70162_CGCTCATT-ATAGAGGC_S3_L001_R2_001.fastq.gz
├── Sample_70166
│ ├── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R1_001.fastq.gz
│ └── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R2_001.fastq.gz
├── scripts
├── groups.txt
└── Snakefile
{
"Sample_70162": {
"R1": [ "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R1.fq.gz"
],
"R2": [ "/gpfs/accounts//SlurmMiKTMC/Sample_70162/Sample_70162.R2.fq.gz"
]
},
{
"Sample_70162": {
"R1": [ "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R1.fq.gz"
],
"R2": [ "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R2.fq.gz"
]
}
}
The structure I need to accommodate is below
RNAseqTutorial/
├── part1
│ ├── 030-150-G
│ │ ├── 030-150-GR1_clipped.fastq.gz
│ │ └── 030-150-GR2_clipped.fastq.gz
│ ├── 030-151-G
│ │ ├── 030-151-GR1_clipped.fastq.gz
│ │ └── 030-151-GR2_clipped.fastq.gz
│ ├── 100T
│ │ ├── 100TR1_clipped.fastq.gz
│ │ └── 100TR2_clipped.fastq.gz
├── part2
│ ├── 030-025G
│ │ ├── 030-025GR1_clipped.fastq.gz
│ │ └── 030-025GR2_clipped.fastq.gz
│ ├── 030-131G
│ │ ├── 030-131GR1_clipped.fastq.gz
│ │ └── 030-131GR2_clipped.fastq.gz
│ ├── 030-138G
│ │ ├── 030-138R1_clipped.fastq.gz
│ │ └── 030-138R2_clipped.fastq.gz
├── part3
│ ├── 030-103G
│ │ ├── 030-103GR1_clipped.fastq.gz
│ │ └── 030-103GR2_clipped.fastq.gz
│ ├── 114T
│ │ ├── 114TR1_clipped.fastq.gz
│ │ └── 114TR2_clipped.fastq.gz
├── scripts
├── groups.txt
└── Snakefile
The main script that generates the json file for the sample wildcards is below
for root, dirs, files in os.walk(args):
for file in files:
if file.endswith("fq.gz"):
full_path = join(root, file)
#R1 will be forward reads, R2 will be reverse reads
m = re.search(r"(.+).(R[12]).fq.gz", file)
if m:
sample = m.group(1)
reads = m.group(2)
FILES[sample][reads].append(full_path)
I just can't seem to wrap my head around a way to accommodate that extra layer. Is there another module or function other than os.walk? Could I somehow force os.walk to skip a directory and merge the part and sample prefixes? Any suggestions would be helpful!
Edited to add: I wasn't clear in describing my problem, and noticed that the second example wasn't representative of the problem, and I fixed the examples accordingly, because the second tree was taken from a directory processed by someone else. Data I get comes in two forms, either samples of only one tissue, where the directory consists of WD, sampled folders, and fastq files, where the fastq files have the same prefix as the sample folders that they reside in. The second example is of samples from two tissues. These tissues must be processed separate from each other. But tissues from both types can be found in separate "Parts", but tissues of the same type from different "Parts" must be processed together. If I could get os.walk to return four tuples, or even use
root,dirs,files*=os.walk('Somedirectory')
where the * would append the rest of the directory string to the files variable. Unfortunately, this method does not go to the file level for the third child directory 'root/part/sample/fastq'. In an ideal world, the same snakemake pipeline would be able to handle both scenarios with minimal input from the user. I understand that this may not be possible, but I figured I ask and see if there was a module that could return all portions of each sample directory string.