I am running a snakemake pipeline on a slurm HPC. Occasionally, jobs will fail due to exceeded wall time or memory. Such failed jobs do not create log files, or their log files are deleted as part of snakemakes automatic removal of files associated with failed jobs. It would be convenient to get the logging information for failed jobs so that I could more easily understand why the job failed.
I currently have logs params sat for each job, and the cluster.json file then calls those logs for each job specifically. A general rule, it's cluster.json call and my snakemake call are shown below.
rule fastqScreen:
input:
Fast1="{sample}/{sample}.R1.fq.gz",
Fast2="{sample}/{sample}.R2.fq.gz"
output:
output1="{sample}/{sample}.fq.gz",
output2="{sample}/{sample}_screen.png",
output3="{sample}/{sample}_screen.txt"
log: "logs/{sample}FastScreen.log"
params:
outprefix = "{sample}"
threads: 4
priority: 3
shell:
"""
cat {input.Fast1} {input.Fast2} > {output.output1} && /home/manninm/Programs/fastq_screen_v0.14.0/fastq_screen --aligner bowtie2 --quiet --force --threads {threads} {output.output1}
"""
"__default__": {
"account": "kretzler",
"job-name": "17_{rule}",
"partition": "standard",
"nodes": "1",
"time": "10:00:00",
"ntasks-per-node": "1",
"cpus-per-task": "1",
"mem": "4g",
"output": "{log}.out.txt",
"error": "{log}.err.txt",
"mail-user": "$USER@umich.edu",
"mail-type": "ALL"
},
"HtSeq_Count": {
"cpus-per-task": "{threads}",
"--mem": "16g",
"time": "8:00:00",
"output": "{log}.out.txt",
"error": "{log}.error.log"
},
snakemake -j 1000 --restart-times 2 --max-jobs-per-second 5 --max-status-checks-per-second 5 --cluster-config cluster.json --cluster 'sbatch --job-name {cluster.job-name} --nodes {cluster.nodes} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {cluster.cpus-per-task} --mem {cluster.mem} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}'
I would like to get the error or reason a job failed printed to the error.log file associated with each job, if at all possible, I don't understand what I am doing wrong that causes log files for failed jobs to disappear.