4

I have cram(bam) files that I want to split by read group. This requires reading the header and extracting the read group ids.

I have this function which does that in my Snakemake file:

def identify_read_groups(cram_file):
    import subprocess
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.split('\n')[:-1]
    return(read_groups)

I have this rule all:

rule all:
input:
    expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))

And this rule to actually do the split:

rule split_cram_by_rg:
input:
    cram_file='cram/{sample}.bam.cram',
    read_groups=identify_read_groups('cram/{sample}.bam.cram')
output:
    'cram/RG_bams/{sample}.RG{read_groups}.bam'
run:
    import subprocess
    read_groups = open(input.readGroupIDs).readlines()
    read_groups = [str(rg.replace('\n','')) for rg in read_groups]
    for rg in read_groups:
        command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
        subprocess.check_output(command, shell=True)

I get this error when doing a dry run:

[E::hts_open_format] fail to open file 'cram/{sample}.bam.cram'
samtools view: failed to open "cram/{sample}.bam.cram" for reading: No such file or directory
TypeError in line 19 of /gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile:
a bytes-like object is required, not 'str'
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 37, in <module>
  File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 19, in identify_read_groups

{sample} isn't being passed to the function.

How do I solve this problem? I'm open to other approaches if I'm not doing this in a 'snakemake-ic' way.

==============

EDIT 1

Ok, the first set of examples I gave had many many issues.

Here's a better (?) set of code, which I hope demonstrates my issue.

import sys
from os.path import join

shell.prefix("set -eo pipefail; ")

def identify_read_groups(wildcards):
    import subprocess
    cram_file = 'cram/' + wildcards + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups)

SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for i in SAMPLES:
    RG_dict[i] = identify_read_groups(i)

rule all:
    input:
        expand('{sample}.boo.txt', sample=list(RG_dict.keys()))

rule split_cram_by_rg:
    input:
        file='cram/{sample}.bam.cram',
        RG = lambda wildcards: RG_dict[wildcards.sample]
    output:
        expand('cram/RG_bams/{{sample}}.RG{input_RG}.bam') # I have a problem HERE. How can I get my read groups values applied here? I need to go from one cram to multiple bam files split by RG (see -r in samtools view below). It can't pull the RG from the input.
    shell:
        'samtools view -b -r {input.RG} {input.file} > {output}'


rule merge_RG_bams_into_one_bam:
    input:
        rules.split_cram_by_rg.output
    output:
        '{sample}.boo.txt'
    message:
        'echo {input}'
    shell:
        'samtools merge {input} > {output}' #not working
        """

==============

EDIT 2

Getting MUCH closer, but currently struggling with expand properly building the lane bam files and keeping the wildcards

I'm using this loop to create the intermediate file names:

for sample in SAMPLES:
    for rg_id in list(return_ID(sample)):
        out_rg_bam.append("temp/lane_bam/{}.ID{}.bam".format(sample, rg_id))

return_ID is a function which takes the sample wildcard and returns a list of the read groups the sample contains

If I use out_rg_bam as an input for a merge rule, then ALL of the files get combined into a merged bam, instead of being split by sample.

If I use expand('temp/realigned/{{sample}}.ID{rg_id}.realigned.bam', sample=SAMPLES, rg_id = return_ID(sample)) then rg_id gets applied to each sample. So if I have two samples (a,b) , with read groups (0,1) and (0,1,2), I end up with a0, a1, a0, a1, a2 and b0, b1, b0, b1, b2.

David M
  • 149
  • 1
  • 9
  • `expand('cram/RG_bams/{{sample}}.RG{input_RG}.bam')` requires information about how to expand `input_RG`. Are these supposed to be the values in `RG_dict` or something like that? – bli Oct 16 '17 at 10:41
  • Also note that if the output is determined based on and expand over a wildcard, this wildcards will not be available inside the rule. If you want to apply `samtools view` once for each read group, then the expand does not belong to the output of the `split_cram_by_rg` rule, but to a rule downstream. Then the read group wildcard will be available in the rules upstream the one where the expand happens. – bli Oct 16 '17 at 10:47
  • You are absolutey correct. I'm still fiddling with getting my example working. I'm much closer and will post when I get it functioning. Right now I'm struggling with building the expand properly. Different bam files have different numbers of read groups, so I can't use the simple expand example I've seen. – David M Oct 24 '17 at 16:05
  • My advice for the expand: Try "by hand" with python looping logic instead of using the snakemake-provided `expand`. – bli Oct 24 '17 at 16:10
  • added edit 2 to explain my current issue – David M Oct 24 '17 at 16:18
  • OK, I think I cracked (solved) it. I ended up having to write a function to create the expanded input files for the merge step. I'm running it now and will post the code should it actually run properly. – David M Oct 24 '17 at 20:06

3 Answers3

3

I'm going to give a more general answer to help others that might find this thread. Snakemake only applies wildcards to strings in the 'input' and 'output' sections when the strings are directly listed, e.g.:

input:
    '{sample}.bam'

If you are trying to use functions like you were here:

input:
    read_groups=identify_read_groups('cram/{sample}.bam.cram')

The wildcard replacement will not be done. You can use a lambda function and do the replacement yourself:

input:
    read_groups=lambda wildcards: identify_read_groups('cram/{sample}.bam.cram'.format(sample=wildcards.sample))

Edit Feb 28, 2023: Of course now python has f-strings, which improves the syntax:

input:
    read_groups=lambda wildcards: identify_read_groups(f'cram/{wildcards.sample}.bam.cram')
Colin
  • 10,447
  • 11
  • 46
  • 54
0

try this: I use id = 0, 1, 2, 3 to name the output bam file depending on how many readgroup for a bam file.

## this is a regular function which takes the cram file, and get the read-group to 
## construct your rule all
## you actually just need the number of @RG, below can be simplified  
def get_read_groups(sample):
    import subprocess
    cram_file = 'cram/' + sample + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups)

SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for sample in SAMPLES:
    RG_dict[sample] = get_read_groups(sample)

outbam = []
for sample in SAMPLES:
    read_groups = RG_dict[sample]
    for i in range(len(read_groups)):
        outbam.append("{}.RG{}.bam".format(sample, id))


rule all:
    input:
        outbam

## this is the input function, only uses wildcards as argument 
def identify_read_groups(wildcards):
    import subprocess
    cram_file = 'cram/' + wildcards.sample + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups[wildcards.id])

rule split_cram_by_rg:
    input:
        cram_file='cram/{sample}.bam.cram',
        read_groups=identify_read_groups
    output:
        'cram/RG_bams/{sample}.RG{id}.bam'  
    run:
        import subprocess
        read_groups = input.read_groups
        for rg in read_groups:
            command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
            subprocess.check_output(command, shell=True)

when use snakemake, think the way bottom up. First define what you want to generate in the rule all, and then construct the rule to create your final all.

crazyhottommy
  • 310
  • 1
  • 3
  • 8
  • Thanks. I don't have time to check this today, but the `outbam.append("{}.RG{}.bam".format(sample, id))` is useful. The actual output is still one file, but I hopefully can build the logic. – David M Oct 14 '17 at 11:33
  • "when use snakemake, think the way bottom up": I think this is a really important advice. Basically, the input of the `all` rule tells snakemake the real names of all the files you want (no wildcards here, possibly by having wildcards replaced by their possible real values through `expand` or other constructions). Then snakemake tries to match these names with the output of upstream rules: This matching process determines the values of the wildcards in a given instance of the matching rule. – bli Oct 16 '17 at 11:04
-1

Your all rule cannot have wildcards. It's a no wildcard-zone.

EDIT 1

I typed this pseudo-code in Notepad++, its not meant to compile, just trying to provide a framework. I think this is more what you are after.

Use a function inside of an expand to generate a list of file names which will then be used to driver the Snakemake pipeline's all rule. The baseSuffix and basePrefix variables are just to give you an idea as to String passing, arguments are permitted here. When passing back the list of strings, you will have to unpack them to ensure Snakemake reads the result properly.

def getSampleFileList(String basePrefix, String baseSuffix){
    myFileList = []
    ListOfSamples = *The wildcard glob call*
    for sample in ListOfSamples:
        command = "samtools -h " + sample + "SAME CALL USED TO GENERATE LIST OF HEADERS"
        for rg in command:
            myFileList.append(basePrefix + sample + ".RG" + rg + baseSuffix)
}


basePreix = "cram/RG_bams/"
baseSuffix = ".bam" 

rule all:
    input:
        unpack(expand("{fileName}", fileName=getSampleFileList(basePrefix, baseSuffix)))


rule processing_rg_files:
    input:
        'cram/RG_bams/{sample}.RG{read_groups}.bam'
    output:
        'cram/RG_TXTs/{sample}.RG{read_groups}.txt'
    run:
        "Let's pretend this is useful code"

END OF EDIT

If it wasn't in the all rule, you'd use inline functions

So I'm not sure what you're trying to accomplish. As per my guesses, read below for some notes about your code.

rule all:
input:
    expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))

The dry run is failing when it calls the function "identify_read_groups" inside the rule all call. It's being passed into your function call as a string, not a wildcard.

Technically, if the samtools call wasn't failing, and the function call "identify_read_groups(cram_file)" returned a list of 5 strings, it would expand to something like this:

rule all:
    input:
        'cram/RG_bams/{sample}.RG<output1FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output2FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output3FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output4FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output5FromFunctionCall>.bam'

But the term "{sample}", at this stage in Snakemake's pre-processing, is considered a string. As you needed to denote wildcards in an expand function with {{}}.

See how I address every Snakemake variable I declare for my rule all input call and don't use wildcards:

expand("{outputDIR}/{pathGVCFT}/tables/{samples}.{vcfProgram}.{form[1][varType]}{form[1][annotated]}.txt", outputDIR=config["outputDIR"], pathGVCFT=config["vcfGenUtil_varScanDIR"], samples=config["sample"], vcfProgram=config["vcfProgram"], form=read_table(StringIO(config["sampleFORM"]), " ").iterrows())

In this case read_table returns 2-dimensional array to form. Snakemake is well supported by python. I needed this for pairing of different annotations to different variant types.

Your rule all needs to be a string, or list of strings, as input. You cannot have wildcards in your "all" rule. These rule all input strings are what Snakemake uses to generate matches for OTHER wildcards. Build the entire filename in the function call and return it if you need to.

I think you should just turn it into something like this:

rule all:
input:
    expand("{fileName}", fileName=myFunctionCall(BecauseINeededToPass, ACoupleArgs))

Also consider updating this to be more generic.:

rule split_cram_by_rg:
    input:
        cram_file='cram/{sample}.bam.cram',
        read_groups=identify_read_groups('cram/{sample}.bam.cram')

It can have two or more wildcards (why we love Snakemake). You can access the wildcards later in the python "run" directive via the wildcards object, since it looks like you'll want to in your for each loop. I think input and output wildcards have to match, so maybe do try it this way as well.

rule split_cram_by_rg:
    input:
        'cram/{sample}.bam.cram'
    output:
        expand('cram/RG_bams/{{sample}}.RG{read_groups}.bam', read_groups=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
    ...
    params:
          rg=myFunctionCall(BecauseINeededToPass, ACoupleArgs)
    run:
        command = 'Just an example ' +  + str(params.rg)

Again, not super sure what you're trying to do, I'm not sure I like the idea of the function call twice, but hey, it would run ;P Also notice the use of a wildcard "sample" in the input directive within a string {} and in the output directive within an expand {{}}.

An example of accessing wildcards in your run directive

Example of function calls in places you wouldn't think. I grabbed VCF fields but it could have been anything. I use an external configfile here.

TBoyarski
  • 441
  • 3
  • 9
  • Thanks, but I'm still confused. My input file is a single cram file. The cram file has multiple read groups (because sequenced across multiple lanes). I want to split the cram file into multiple files, one for each unique read group. I cannot make this work. – David M Oct 13 '17 at 16:47
  • Yea, a lot going on here and I'm not convinced you need any of it. The end result you want is like a sample file name for each respective read group within the cram file? – TBoyarski Oct 13 '17 at 17:27
  • Yes....I think this should be simple, but I keep trying to use a function in an output to generate the proper names and it seems that this is not how you are supposed to do Snakemake. But I can't give the read groups as wild cards since I don't know what the read groups are until I read the input file. – David M Oct 13 '17 at 18:08
  • Just needing a bit more clarity. Where are you getting your list of sample names from? And the read groups for any given sample are determined by reading the header file of that file correct? IE, if the header file doesn't have that read group, don't make that sample specific file for that read group. – TBoyarski Oct 13 '17 at 18:32
  • List of sample names (probably) will come from using a glob on a folder. You are correct about how the read groups work. – David M Oct 13 '17 at 18:52
  • This appears to be similar problem. With ugly answers, unfortunately. There's no elegant way to solve this kind of problem in snakemake? https://groups.google.com/forum/#!searchin/snakemake/dict$20output%7Csort:relevance/snakemake/KohvjZNn2QY/2LNmRUiGUqYJ – David M Oct 13 '17 at 20:08