2

I'm trying to find the most elegant solution, using snakemake, to move and rename ~1000 fastq files that are stored in around 50 separate folders. My original attempt was storing the file location and new sample ID data in the config file using:

CONFIG

samples: 
    15533_Oct_2014/15533_L7_R1_001.fastq.gz: 15533_Extr_L7_R1.fastq.gz
    15533_Oct_2014/15533_L7_R2_001.fastq.gz: 15533_Extr_L7_R2.fastq.gz
    16826_Jan_2015/16826_L8_R1_001.fastq: 16826_Extr_L8_R1.fastq
    16826_Jan_2015/16826_L8_R2_001.fastq: 16826_Extr_L8_R2.fastq

SNAKEFILE

rule all:
    input:
       expand("fastqs/{sample}", sample=[config['samples'][x] for x in config['samples']])

rule move_and_rename_fastqs:
    input:
    output: "fastqs/{sample}"
    shell:
    """echo mv {input} {output}"""

Running snakemake -np produces the shell commands without error. It correctly creates 4 instances of the rule and populates{output} with an individual filename (i.e. the new filename specified to the right of the colon in the config file).

My issue is that I'm not 100% sure how to populate the {input} section of the shell command with the file location (i.e. to get the corresponding location stored to right of the colon in the config file). When using various lambda wildcards: in an attempt to access these locations I get errors.

Incidentally, this post suggests an alternative, and perhaps more elegant, method to tackle this by storing the file locations/new names in a .tsv file. However, it does not explain how to access information in the .tsv file within the rules.

I have made an attempt at a Snakefile for this, but it is unclear to me how to reference the information stored sampleID and fastq either in rule move_and_rename_fastqs or rule all. Although snakemake -np produces an output here, it is obviously gobbledygook as {input} is populated with all the files assigned to fastq, and as I'm referencing two sources for the sample information (config file in rule all, sample_file in rule move_and_rename_fastqs), the sample IDs populating the {input} and {output} sections don't match as the should.

Any guidance with regard to the most elegant solution to get round this issue would be greatly appreciated.

SNAKEFILE 2

import pandas as pd

configfile: "config.yaml"
sample_file = config["sample_file"]

sampleID = pd.read_table(sample_file)['sampleID']
fastq = pd.read_table(sample_file)['fastq']


rule all:
    input:
       expand("fastqs/{sample}", sample=[config['samples'][x] for x in config['samples']])

rule move_and_rename_fastqs:
    input: fastq = lambda wildcards: fastq
    output: "fastqs/{sample}"
    shell:
    """echo mv {input.fastq} {output}"""

sample_file

fastq   sampleID
15533_Oct_2014/15533_L7_R1_001.fastq.gz 15533_Extr_L7_R1.fastq.gz
15533_Oct_2014/15533_L7_R2_001.fastq.gz 15533_Extr_L7_R2.fastq.gz

RESPONSE to UNFUN CAT

import pandas as pd

configfile: "config.yaml"
sample_file = config["sample_file"]

sampleID = pd.read_table(sample_file)['sampleID']
fastq = pd.read_table(sample_file)['fastq']
df = pd.read_table(sample_file)


rule all:
    input:
       expand("fastqs/{sample}", sample=[config['samples'][x] for x in config['samples']])

rule move_and_rename_fastqs:
    input:  fastq = lambda w: df[df.sampleID == w.sample].File.tolist()
    output: "fastqs/{sample}"
    shell:
            """echo mv {input.fastq} {output}"""
Darren
  • 277
  • 4
  • 17

1 Answers1

3
import pandas as pd

configfile: "config.yaml"
sample_file = config["sample_file"]

sampleID = pd.read_table(sample_file)['sampleID']
fastq = pd.read_table(sample_file)['fastq']
df = pd.read_table(sample_file)


rule all:
    input:
       expand("fastqs/{sample}", sample=[config['samples'][x] for x in config['samples']])

rule move_and_rename_fastqs:
    input: fastq = lambda w: df[df.sampleID == w.sample].fastq.tolist()
    output: "fastqs/{sample}"
    shell:
    """echo mv {input.fastq} {output}"""

Edit: Version that works without any config-files:

import pandas as pd

from io import StringIO

sample_file = StringIO("""fastq   sampleID
15533_Oct_2014/15533_L7_R1_001.fastq.gz 15533_Extr_L7_R1.fastq.gz
15533_Oct_2014/15533_L7_R2_001.fastq.gz 15533_Extr_L7_R2.fastq.gz""")

df = pd.read_table(sample_file, sep="\s+", header=0)
sampleID = df.sampleID
fastq = df.fastq

rule all:
    input:
       expand("fastqs/{sample}", sample=df.sampleID)

rule move_and_rename_fastqs:
    input: fastq = lambda w: df[df.sampleID == w.sample].fastq.tolist()
    output: "fastqs/{sample}"
    shell:
            """echo mv {input.fastq} {output}"""

Gives:

snakemake -np
Building DAG of jobs...
Job counts:
    count   jobs
    1   all
    2   move_and_rename_fastqs
    3

[Mon Jun 29 15:57:30 2020]
rule move_and_rename_fastqs:
    input: 15533_Oct_2014/15533_L7_R2_001.fastq.gz
    output: fastqs/15533_Extr_L7_R2.fastq.gz
    jobid: 2
    wildcards: sample=15533_Extr_L7_R2.fastq.gz

echo mv 15533_Oct_2014/15533_L7_R2_001.fastq.gz fastqs/15533_Extr_L7_R2.fastq.gz

[Mon Jun 29 15:57:30 2020]
rule move_and_rename_fastqs:
    input: 15533_Oct_2014/15533_L7_R1_001.fastq.gz
    output: fastqs/15533_Extr_L7_R1.fastq.gz
    jobid: 1
    wildcards: sample=15533_Extr_L7_R1.fastq.gz

echo mv 15533_Oct_2014/15533_L7_R1_001.fastq.gz fastqs/15533_Extr_L7_R1.fastq.gz

[Mon Jun 29 15:57:30 2020]
localrule all:
    input: fastqs/15533_Extr_L7_R1.fastq.gz, fastqs/15533_Extr_L7_R2.fastq.gz
    jobid: 0

Job counts:
    count   jobs
    1   all
    2   move_and_rename_fastqs
    3
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • Thanks for the suggestion. Unfortunately, it throws the following error. ```InputFunctionException in line 14 of /nfshome/store03/users/c.c1477909/sample_info/Snakefile: AttributeError: 'Series' object has no attribute 'sampleID' Wildcards: sample=15533_Extr_L7_R1.fastq.gz```. As I'm not sure exactly what your code does, I'm unsure how to get round this. – Darren Jun 25 '20 at 16:23
  • Ah, you need to read the sample sheet into a dataframe, then it will work. `df = pd.read_table(sample_file)`. Then replace fastq with df in the lambda. – The Unfun Cat Jun 26 '20 at 07:30
  • Thanks again. Unfortunately that doesn't work either. I get an input function error: AttributeError: 'DataFrame' object has no attribute 'File' Wildcards: sample=15533_Extr_L7_R1.fastq.gz. I'm a bit confused as to why the `sampleID` is referenced in the input when I'm trying to have the fastq file locations (list under the `fastq` heading in the df) in the input. I need the new names (listed under `sampleID` heading in the df) in the output. The rule all is also still referencing the config file rather than the `sample_file` df, so your example is using two different sources for the same info. – Darren Jun 26 '20 at 12:00
  • That resolves without errors, but there are no file locations fed to the input portion of the shell command. So it's just the same output as I initially had with my script. The shell command looks like this: `echo mv fastqs/17071_Extr_L3_R1.fastq` I need the file locations after the `mv`. – Darren Jun 26 '20 at 15:06
  • You forgot to update the lambda. Yours still says File, not fastq :) – The Unfun Cat Jun 29 '20 at 13:41