Altering snakemake workflow to anticipate and accommodate different data-structures

Question

I have an existing snakemake RNAseq workflow that works fine with a directory tree as below. I need to alter the workflow so that it can accommodate another layer of directories. Currently, I use a python script that os.walks the parent directory and creates a json file for the sample wildcards (json file for sample widlcards also included below). I am not very familiar with python, and it seems to me that adapting the code for an extra layer of directories shouldn't be too difficult and was hoping someone would be kind enough to point me in the right direction.

RNAseqTutorial/
├── Sample_70160
│   ├── 70160_ATTACTCG-TATAGCCT_S1_L001_R1_001.fastq.gz
│   └── 70160_ATTACTCG-TATAGCCT_S1_L001_R2_001.fastq.gz
├── Sample_70161
│   ├── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R1_001.fastq.gz
│   └── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R2_001.fastq.gz
├── Sample_70162
│   ├── 70162_CGCTCATT-ATAGAGGC_S3_L001_R1_001.fastq.gz
│   └── 70162_CGCTCATT-ATAGAGGC_S3_L001_R2_001.fastq.gz
├── Sample_70166
│   ├── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R1_001.fastq.gz
│   └── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R2_001.fastq.gz
├── scripts
├── groups.txt
└── Snakefile

{
    "Sample_70162": {
        "R1": [ "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R1.fq.gz"
        ],
        "R2": [  "/gpfs/accounts//SlurmMiKTMC/Sample_70162/Sample_70162.R2.fq.gz"
        ]
    },
{
    "Sample_70162": {
        "R1": [          "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R1.fq.gz"
        ],
        "R2": [     "/gpfs/accounts/SlurmMiKTMC/Sample_70162/Sample_70162.R2.fq.gz"
        ]
    }
}

The structure I need to accommodate is below

RNAseqTutorial/
├── part1
│   ├── 030-150-G
│   │   ├── 030-150-GR1_clipped.fastq.gz
│   │   └── 030-150-GR2_clipped.fastq.gz
│   ├── 030-151-G
│   │   ├── 030-151-GR1_clipped.fastq.gz
│   │   └── 030-151-GR2_clipped.fastq.gz
│   ├── 100T
│   │   ├── 100TR1_clipped.fastq.gz
│   │   └── 100TR2_clipped.fastq.gz
├── part2
│   ├── 030-025G
│   │   ├── 030-025GR1_clipped.fastq.gz
│   │   └── 030-025GR2_clipped.fastq.gz
│   ├── 030-131G
│   │   ├── 030-131GR1_clipped.fastq.gz
│   │   └── 030-131GR2_clipped.fastq.gz
│   ├── 030-138G
│   │   ├── 030-138R1_clipped.fastq.gz
│   │   └── 030-138R2_clipped.fastq.gz
├── part3
│   ├── 030-103G
│   │   ├── 030-103GR1_clipped.fastq.gz
│   │   └── 030-103GR2_clipped.fastq.gz
│   ├── 114T
│   │   ├── 114TR1_clipped.fastq.gz
│   │   └── 114TR2_clipped.fastq.gz
├── scripts
├── groups.txt
└── Snakefile

The main script that generates the json file for the sample wildcards is below

    for root, dirs, files in os.walk(args):
        for file in files:
            if file.endswith("fq.gz"):
                full_path = join(root, file)
                #R1 will be forward reads, R2 will be reverse reads
                m = re.search(r"(.+).(R[12]).fq.gz", file)
                if m:
                    sample = m.group(1)
                    reads = m.group(2)  
                    FILES[sample][reads].append(full_path)

I just can't seem to wrap my head around a way to accommodate that extra layer. Is there another module or function other than os.walk? Could I somehow force os.walk to skip a directory and merge the part and sample prefixes? Any suggestions would be helpful!

Edited to add: I wasn't clear in describing my problem, and noticed that the second example wasn't representative of the problem, and I fixed the examples accordingly, because the second tree was taken from a directory processed by someone else. Data I get comes in two forms, either samples of only one tissue, where the directory consists of WD, sampled folders, and fastq files, where the fastq files have the same prefix as the sample folders that they reside in. The second example is of samples from two tissues. These tissues must be processed separate from each other. But tissues from both types can be found in separate "Parts", but tissues of the same type from different "Parts" must be processed together. If I could get os.walk to return four tuples, or even use

root,dirs,files*=os.walk('Somedirectory')

where the * would append the rest of the directory string to the files variable. Unfortunately, this method does not go to the file level for the third child directory 'root/part/sample/fastq'. In an ideal world, the same snakemake pipeline would be able to handle both scenarios with minimal input from the user. I understand that this may not be possible, but I figured I ask and see if there was a module that could return all portions of each sample directory string.

score 0 · Answer 1 · answered Oct 16 '19 at 13:46

It seems to me that your problem doesn't have much to do on how to accommodate the second layer. Instead, the question is about the specifications of the directory trees and file names you expect.

In the first case, it seems you can extract the sample name from the first part of the file name. In the second case, file names are all the same and the sample name comes from the parent directory. So, either you implement some logic that tells which naming scheme you are parsing (and this depends on who/what provides the files) or you always extract the sample name from the parent directory as this should work also for the first case (but again, assuming you can rely on such naming scheme).

If you want to go for the second option, something like this should do:

FILES = {}
for root, dirs, files in os.walk('RNAseqTutorial'):
    for file in files:
        if file.endswith("fastq.gz"):
            sample = os.path.basename(root)
            full_path = os.path.join(root, file)
            if sample not in FILES:
                FILES[sample]= {}
            if 'R1' in file:
                reads = 'R1'
            elif 'R2' in file:
                reads = 'R2'
            else:
                raise Exception('Unexpected file name')
            if reads not in FILES[sample]:
                FILES[sample][reads] = []
            FILES[sample][reads].append(full_path)

score 0 · Answer 2 · answered Oct 16 '19 at 13:47

Not sure if I understand correctly, but here you go:

for root, dirs, files in os.walk(args):
    for file in files:
        if file.endswith("fq.gz"):
            full_path = join(root, file)
            reads = 'R1' if 'R1' in file else 'R2'
            sample = root.split('/')[-1]
            FILES[sample][reads].append(full_path)

Altering snakemake workflow to anticipate and accommodate different data-structures

2 Answers2