Snakemake rule is run only for one file

Question

I have a rule in snakemake that runs HDBSCAN clustering. Previously it was regular DBSCAN and was working fine, but after I modified it, somehow the problem started (I modified also Snakemake file for other reasons, so hard to say what is to blame). So, I started seeing such a picture when only one file is run through HDBSCAN and results are generated. It gives no errors, just that next rules say that they are waiting for the missing files (that were not generated by the rule that runs HDBSCAN). Here is how the relevant parts of the Snakemake file look like:

configfile: "config.yml"

samples,=glob_wildcards('data_files/normalized/{sample}.hdf5')
rule all:
    input:
        expand('results/tsne/{sample}_tsne.csv', sample=samples),
        expand('results/umap/{sample}_umap.csv', sample=samples),
        expand('results/umap/img/{sample}_umap.png', sample=samples),
        expand('results/tsne/img/{sample}_tsne.png', sample=samples),
        expand('results/clusters/umap/{sample}_umap_clusters.csv', sample=samples),
        expand('results/clusters/tsne/{sample}_tsne_clusters.csv', sample=samples),
        expand('results/neo4j/{sample}/{file}', sample=samples,
          file=['cells.csv', 'genes.csv', 'cl_contains.csv', 'cl_isin.csv', 'cl_nodes.csv', 'expr_by.csv', 'expr_ess.csv']),
        'results/neo4j/db_command'

rule cluster:
    input:
        script = 'python/dbscan.py',
        umap   = 'results/umap/{sample}_umap.csv'
    output:
        umap = 'results/umap/img/{sample}_umap.png',
        clusters_umap = 'results/clusters/umap/{sample}_umap_clusters.csv'
    shell:
        "python {input.script} -umap_data {input.umap} -min_cluster_size {config[dbscan][min_cluster_size]} -img_umap {output.umap} -clusters_umap {output.clusters_umap}"

Here is how the dbscan.py looks like:

import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('agg')
from hdbscan import HDBSCAN
import pathlib
import os
import nice_service as ns

def run_dbscan(args):
    print('running HDBSCAN')

    path_to_img = args['-img_umap']
    path_to_clusters = args['-clusters_umap']
    path_to_data = args['-umap_data']

    # If folders in paths do not exist, create them
    for path_to_save in path_to_img:
        img_dir = os.path.dirname(path_to_save)
        pathlib.Path(img_dir).mkdir(parents=True, exist_ok=True) 

    for path_to_save in path_to_clusters:
        cluster_dir = os.path.dirname(path_to_save)
        pathlib.Path(cluster_dir).mkdir(parents=True, exist_ok=True) 

    #for idx, path_to_data in enumerate(data_arr):
    data = np.loadtxt(open(path_to_data, "rb"), delimiter=",")
    db = HDBSCAN(min_cluster_size=int(args['-min_cluster_size'])).fit(data)

    # 'TRUE' where the point was assigned to cluster, 'FALSE' where not assigned
    # aka 'noise'
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.labels_ != -1] = True
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    print('Estimated number of clusters: %d' % n_clusters_)
    unique_labels = set(labels)
    colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = [0, 0, 0, 1]
        class_member_mask = (labels == k)
        xy = data[class_member_mask & core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], '.', color=tuple(col), markersize=1)
            #plt.legend()

    plt.title('Estimated number of clusters: %d' % n_clusters_)
    plt.savefig(path_to_img, dpi = 500)
    np.savetxt(path_to_clusters, labels.astype(int), fmt='%i', delimiter=",")
    print('Finished running HDBSCAN algorithm')

if __name__ == '__main__':
    from sys import argv
    myargs = ns.getopts(argv)
    print(myargs)
    run_dbscan(myargs)

The input files for the rule cluster are all present and are correct. Somehow the rule is just skipped for all other files but one.

You may want to look into contents of list `samples`. It sounds like `data_files/normalized/` has only one `.hdf5` files, but you think it has more than one. — Manavalan Gajapathy, Feb 08 '19 at 21:35
How to do that? I built DAG graph and saved it into `dag.pdf` file: all datasets are there — Nikita Vlasenko, Feb 08 '19 at 21:36
I printed `samples` out, they have correct names. Also checked `data_files/normalized` folder: it has all the files — Nikita Vlasenko, Feb 08 '19 at 21:41

score 0 · Answer 1 · answered Feb 09 '19 at 20:46

The issue turned out to be that inside the script for the last rule I forgot to output one file. It was generating 6 files instead of 7. It was misleading for me because the snakemake was not running all of the files for one rule, then for the next one, but ran only one file for all the rules and then got stuck.

Snakemake rule is run only for one file

1 Answers1