1
# md5sum on fastq folder on cluster
rule md5sum_fastq_cluster:
     input:
         path_cluster+'/'+project_name+'/'+project_name+'.csv'
     output:
         path_cluster+'/'+project_name+'/'+'md5sum.txt'
     shell:
         """find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1, gensub( ".*/", "", $2 )}}' | sort > {output}"""
 
 
 # md5sum on fastq folder on remote server
 rule md5sum_fastq_SAN:
     input:
         copyFASTQdone
     output:
         SFTPsan.remote(server_san+path_san+'/'+project_name+'/md5sum.txt')
     shell:
         """ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1, gensub( ".*/", "", \$2 )}}' | sort" > {output}"""

--------------------------------------------------------------------------
awk: ligne de commande:1: {print $1, gensub( .*/, , $2 )}
awk: ligne de commande:1:                    ^ syntax error
awk: ligne de commande:1: {print $1, gensub( .*/, , $2 )}

Obviously my syntax for gensub is wrong
Before adding the gensub command, my 2 shell commands from the 2 rules were :

"""find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1}}' | sort > {output}"""

"""ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1}}' | sort > {output}"""

It was working. It's just since I added the gensub, I can't find the right syntax.
I need this gensub to basically do the same thing as basename to remove the path of my files.
And of course, I tried the awk/gensub command outside my snakemake, it works.

Just in case, here are the files produced by my rules :

# md5sum.txt before gensub
01afd3f2bf06d18c5609b2c2c963eddf /data/imrb/Data/200122_GSC/14-CTRL50TMZ1907192_S11_R2_001.fastq.gz
03e353c316aef09c748aa2363db95599 /data/imrb/Data/200122_GSC/15-11650TMZ1907192_S12_R2_001.fastq.gz
1ba21b8be882bcb62c464ba515800ca4 /data/imrb/Data/200122_GSC/1-CTRL120719_S1_R2_001.fastq.gz

# md5sum.txt after gensub
01afd3f2bf06d18c5609b2c2c963eddf 14-CTRL50TMZ1907192_S11_R2_001.fastq.gz
03e353c316aef09c748aa2363db95599 15-11650TMZ1907192_S12_R2_001.fastq.gz
1ba21b8be882bcb62c464ba515800ca4 1-CTRL120719_S1_R2_001.fastq.gz
Elysire
  • 693
  • 10
  • 23
  • Start by making `ssh imrb@{server_san} "echo 7 | awk '{print \$1, gensub( ".*/", "", \$2 )}'"` or similar work and THEN worry about including that in your much larger script. I assume you have some reason for doubling every `{` and `}` in your full script. – Ed Morton Jul 09 '20 at 16:14
  • It is working if I don't try to include it in my Snakemake. In Snakemake, the `{}` inside a shell command are used to refer to elements of the rule (input, output...) or wildcards. You have to double `{` if you want that Snakemake understand it as a shell caracter. – Elysire Jul 17 '20 at 11:23

2 Answers2

1

You have double quotes wrapping the command passed to ssh (marked below with ^) so you need to escape the double quotes inside awk. This may work:

"""ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1, gensub( \".*/\", \"\", \$2 )}}' | sort" > {output}"""
                     ____^____                                                                                                                                 ___^___

(I would also suggest using raw strings for the shell commands to prevent interpretation of metacharacters, i.e. use r""" ... """)

dariober
  • 8,240
  • 3
  • 30
  • 47
  • I have 2 rules, the first one : the shell command is local, the 2nd one : the shell command is passed to SSH. The genesub inside the shell command doesn't work neither in local or ssh. Your syntax doesn't work for the rule with ssh but it works for the first one, but without the backslash before `$2`. Here is the solution for the first rule : `"""find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1, gensub( \".*/\", \"\", $2 )}}' | sort > {output}"""` So now I still have to find the syntax for the command passed to SSH. – Elysire Jul 17 '20 at 11:16
  • Can you explain more what do you mean by 'using raw strings' ? Because I have to use wildcards I can't just write paths in my commands. – Elysire Jul 17 '20 at 11:31
0

Thanks to dariober I found the right syntax for each rule.

For the first rule : I need to escape the double quotes I use inside my awk

rule md5sum_fastq_cluster:
     input:
         path_cluster+'/'+project_name+'/'+project_name+'.csv'
     output:
         path_cluster+'/'+project_name+'/'+'md5sum.txt'
     shell:
         """find {path_cluster}/{project_name} -type f -name "*.fastq.gz" -exec md5sum {{}} + | awk '{{print $1, gensub( \".*/\", \"\", $2 )}}' | sort > {output}"""

For the second rule, the shell command is passed to SSH, I need to double escape my double quotes and add a \ before $2

 rule md5sum_fastq_SAN:
     input:
         copyFASTQdone
     output:
         SFTPsan.remote(server_san+path_san+'/'+project_name+'/md5sum.txt')
     shell:
         """ssh imrb@{server_san} "find {path_san}/{project_name} -type f -name '*.fastq.gz' -exec md5sum {{}} + | awk '{{print \$1, gensub( \\".*/\\", \\"\\", \$2 )}}' | sort" > {output}"""
Elysire
  • 693
  • 10
  • 23