I have a rule in snakemake
that runs HDBSCAN
clustering. Previously it was regular DBSCAN
and was working fine, but after I modified it, somehow the problem started (I modified also Snakemake
file for other reasons, so hard to say what is to blame). So, I started seeing such a picture when only one file is run through HDBSCAN
and results are generated. It gives no errors, just that next rules say that they are waiting for the missing files (that were not generated by the rule that runs HDBSCAN
). Here is how the relevant parts of the Snakemake
file look like:
configfile: "config.yml"
samples,=glob_wildcards('data_files/normalized/{sample}.hdf5')
rule all:
input:
expand('results/tsne/{sample}_tsne.csv', sample=samples),
expand('results/umap/{sample}_umap.csv', sample=samples),
expand('results/umap/img/{sample}_umap.png', sample=samples),
expand('results/tsne/img/{sample}_tsne.png', sample=samples),
expand('results/clusters/umap/{sample}_umap_clusters.csv', sample=samples),
expand('results/clusters/tsne/{sample}_tsne_clusters.csv', sample=samples),
expand('results/neo4j/{sample}/{file}', sample=samples,
file=['cells.csv', 'genes.csv', 'cl_contains.csv', 'cl_isin.csv', 'cl_nodes.csv', 'expr_by.csv', 'expr_ess.csv']),
'results/neo4j/db_command'
rule cluster:
input:
script = 'python/dbscan.py',
umap = 'results/umap/{sample}_umap.csv'
output:
umap = 'results/umap/img/{sample}_umap.png',
clusters_umap = 'results/clusters/umap/{sample}_umap_clusters.csv'
shell:
"python {input.script} -umap_data {input.umap} -min_cluster_size {config[dbscan][min_cluster_size]} -img_umap {output.umap} -clusters_umap {output.clusters_umap}"
Here is how the dbscan.py
looks like:
import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('agg')
from hdbscan import HDBSCAN
import pathlib
import os
import nice_service as ns
def run_dbscan(args):
print('running HDBSCAN')
path_to_img = args['-img_umap']
path_to_clusters = args['-clusters_umap']
path_to_data = args['-umap_data']
# If folders in paths do not exist, create them
for path_to_save in path_to_img:
img_dir = os.path.dirname(path_to_save)
pathlib.Path(img_dir).mkdir(parents=True, exist_ok=True)
for path_to_save in path_to_clusters:
cluster_dir = os.path.dirname(path_to_save)
pathlib.Path(cluster_dir).mkdir(parents=True, exist_ok=True)
#for idx, path_to_data in enumerate(data_arr):
data = np.loadtxt(open(path_to_data, "rb"), delimiter=",")
db = HDBSCAN(min_cluster_size=int(args['-min_cluster_size'])).fit(data)
# 'TRUE' where the point was assigned to cluster, 'FALSE' where not assigned
# aka 'noise'
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.labels_ != -1] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = data[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], '.', color=tuple(col), markersize=1)
#plt.legend()
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.savefig(path_to_img, dpi = 500)
np.savetxt(path_to_clusters, labels.astype(int), fmt='%i', delimiter=",")
print('Finished running HDBSCAN algorithm')
if __name__ == '__main__':
from sys import argv
myargs = ns.getopts(argv)
print(myargs)
run_dbscan(myargs)
The input files for the rule cluster
are all present and are correct. Somehow the rule is just skipped for all other files but one.