I'm trying to use agglomerative clustering with a custom distance metric (ie affinity) since I'd like to cluster a sequence of integers by sequence similarity and not something like the euclidean distance which isn't meaningful.
My data looks something like this
>> dat.values
array([[860, 261, 240, ..., 300, 241, 1],
[860, 840, 860, ..., 860, 240, 1],
[260, 860, 260, ..., 260, 220, 1],
...,
[260, 260, 260, ..., 260, 260, 1],
[260, 860, 260, ..., 840, 860, 1],
[280, 240, 241, ..., 240, 260, 1]])
I've created the following similarity function
def sim(x, y):
return np.sum(np.equal(np.array(x), np.array(y)))/len(x)
So I just return the % matching values in the two sequences with numpy and make the following call
cluster = AgglomerativeClustering(n_clusters=5, affinity=sim, linkage='average')
cluster.fit(dat.values)
But I'm getting an error saying
TypeError: sim() missing 1 required positional argument: 'y'
I'm not sure why I'm getting this error; I thought the function will cluster pairs of rows so each required argument would be passed.
Any help with this would be greatly appreciated