Nextflow: publishDir, output channels, and output subdirectories

Question

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.

Here is the nextflow process in question:

process GupcallBases {
    publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
    
    executor = 'pbspro'
    clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
     
    output:
    path "*.bam" into bams_ch
            
    script:
    """
    module load cuda/11.4.2
    singularity exec --nv $params.Gup_container \
            guppy_basecaller --config $params.P1_gupConf \
            --device "cuda:0" \
            --bam_out \
            --recursive \
            --compress \
            --align_ref $params.refGen \
            -i $params.P1_inDir \
            -s $params.P1_outDir \
            --gpu_runners_per_device $params.P1_GPU_runners \
            --num_callers $params.P1_callers
    """
}

The output of the process is something like this:

$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)

I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.

The output syntax was chosen since once this process is done, using the following channel works:

//    Channel
//      .fromPath("${params.P1_outDir}/pass/*.bam")
//      .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
//      .set { bams_ch }

But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.

Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.

Thanks in advance.

score 3 · Accepted Answer · answered Jan 17 '22 at 15:39

Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.

The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.

Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:

params.publishDir = './results'

input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )

process GuppyBasecaller {

    publishDir(
        path: "${params.publishDir}/GuppyBasecaller",
        mode: 'copy',
        saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
    )
    beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
    container '/path/to/guppy_basecaller.img'

    input:
    path input_dir
    path guppy_config
    path ref_genome

    output:
    path "outdir/pass/*.bam" into bams_ch

    """
    mkdir outdir
    guppy_basecaller \\
        --config "${guppy_config}" \\
        --device "cuda:0" \\
        --bam_out \\
        --recursive \\
        --compress \\
        --align_ref "${ref_genome}" \\
        -i "${input_dir}" \\
        -s outdir \\
        --gpu_runners_per_device "${params.guppy_gpu_runners}" \\
        --num_callers "${params.guppy_callers}"
    """
}

Thanks! That mostly seems to have worked. For some reason using `prescript` didn't work, and I kept getting errors related to cuda libraries so I reverted to my original method. There are two questions I have from what you've suggested: 1. What is the benefit of `input_dir = file( params.input_dir )` then `-i "${input_dir}" over `-i "${params.P1_inDir}"` for the input? 2. Could you please explain this `fn -> fn.substring(fn.lastIndexOf('/')+1`? Just so I haven't misinterpreted the reason. — dthorbur, Jan 19 '22 at 22:04
@Miles No worries at all! I'm really glad that is working. The benefit of using `input_dir = file(params.input_dir)` is that 'input_dir' is now implicitly a [value channel](https://www.nextflow.io/docs/latest/channel.html#value-channel). So specifying this channel in the 'input' declaration will ensure that your inputs are available in the working directory when the job is run. Note that `path input_dir` is just syntactic sugar for `path input_dir from input_dir`: see [input of generic values](https://www.nextflow.io/docs/latest/process.html#input-of-generic-values). — Steve, Jan 19 '22 at 23:44
The 'saveAs' option can be ignored, but basically it just strips out the 'outdir/pass' prefix from each of the outputs. `fn.lastIndexOf('/')`, like the name suggests, finds the index of the last '/' character in the string. If we add one, we get the index of the next character. We can then just substring from that index position to get the [name](https://www.nextflow.io/docs/latest/script.html#check-file-attributes) of the file. A bit like this closure: `{ fn -> file(fn).name }` but uses string functions only. — Steve, Jan 19 '22 at 23:52
I haven't run `guppy_basecaller` before, but if it automatically tries to create the outdir for you (if it doesn't exist for example), then you might even be able to change the output declaration to `path "pass/*.bam" into bams_ch`. At this point, you might decide to drop the `saveAs` option altogether. — Steve, Jan 19 '22 at 23:57
I tried using the output declaration of `"pass/*.bam" into bams_ch` before and now I'm a little unsure why it didn't work, but it was then paired with my old `publishDir` syntax so that may explain the problem. — dthorbur, Jan 20 '22 at 09:12

Nextflow: publishDir, output channels, and output subdirectories

1 Answers1