3

I'm new to snakemake and running into some behavior I don't understand. I have a set of fastq files with file names following the standard Illumina convention:

SAMPLENAME_SAMPLENUMBER_LANE_READ_001.fastq.gz

In a directory reads/raw_fastq. I'd like to create symbolic links to simplify the names to follow the pattern:

SAMPLENAME_READ.fastq.gz

In a directory reads/renamed_raw_fastq

My aim is that as I add new fastq files to the project, snakemake will create symlinks only for the newly-added files.

My snakefile is as follows:

# Get sample names from read file names in the "raw" directory

readRootDir = 'reads/'
readRawDir = readRootDir + 'raw_fastq/'

import os

samples = list(set([x.split('_', 1)[0] for x in os.listdir(readRawDir)]))
samples.sort()

# Generate simplified names

readRenamedRawDir = readRootDir + 'renamed_raw_fastq/'

newNames = expand(readRenamedRawDir + "{sample}_{read}.fastq.gz", sample = samples, read = ["R1", "R2"])

# Create symlinks

import glob

def getRawName(wildcards):
    rawName = glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0]
    return rawName

rule all:
    input: newNames 

rule rename:
    input: getRawName
    output: "reads/renamed_raw_fastq/{sample}_{read}.fastq.gz"
    shell: "ln -sf {input} {output}"

When I run snakemake, it tries to generate the symlinks as expected but:

  1. Always tries to create the target symlinks, even when they already exist and have later timestamps than the source fastq files.

  2. Throws errors like:

MissingOutputException in line 68 of /work/nick/FAW-MIPs/renameRaw.snakefile:
Missing files after 5 seconds:
reads/renamed_raw_fastq/Ben21_R2.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

It's almost like snakemake isn't seeing the ouput files it creates. Can anyone suggest what I might be missing here?

Thanks!

  • 2
    Try to run the workflow with the **--printshellcmds** flag. This will provide the exact shell commands that Snakemake uses while executing the rules. Try to run the same commands manually. Do the symbolic links exist after manual runs? Another hint: try to copy instead of symbolic links creation. Does this work? – Dmitry Kuzminov Sep 10 '19 at 21:15
  • The `-r` flag gives the reason why Snakemake chose to recreate the files :) – The Unfun Cat Sep 11 '19 at 11:05
  • Thanks for these tips on flags for more informative output - much appreciated. – Nick Miller Sep 11 '19 at 13:39

1 Answers1

2

I think

ln -sf {input} {output}

gives a symlink pointing to a missing file, i.e., it doesn't point to the source file. You could fix it by e.g. using absolute paths, like:

def getRawName(wildcards):
    rawName = os.path.abspath(glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0])
    return rawName

(As an aside, I would make sure that renaming fastq files the way you do doesn't result in a name-collision, for example when the same sample is sequenced on different lanes of the same flow cell.)

dariober
  • 8,240
  • 3
  • 30
  • 47
  • I take your point about the name collision. In this case the data are coming from a miniseq, so the lane number is always the same. – Nick Miller Sep 11 '19 at 13:38