1

I am trying to do some grouping and am encountering this error.

Evaluation error: the 'height' component of 'tree' is not sorted (increasingly).

My input is:

library(stringdist)
name <- c("luke,abcdef","luke,abcdeh","luke,abcdeg")
a<-stringdistmatrix(name, method="jw")
clusts <- hclust(a, method="ward.D2")

But when I try to cut it, it gives me an error:

> cutree(clusts, h = 0.155)
Error in cutree(clusts, h = 0.155) : 
  the 'height' component of 'tree' is not sorted (increasingly)

But if I use

a<-stringdistmatrix(name, method="jw", p=0.05)

everything works fine.

I have looked for a solution and couldn't find one. What should I do, to prevent this from happening and keep it working?

I have also noticed, that if I have the same distance matrix, but generated by hand (so there is no distance parameter in the cluster.

Ravonrip
  • 584
  • 1
  • 7
  • 17
  • If you look at `diff(clusts$height)` for these two examples, the first comes out as a tiny negative number, the second as exactly zero. Basically the problem is that in this simple case all the distances are the same but there are small rounding differences due to imperfect binary representation of decimal numbers. I don't think you would get this problem with a more varied set of strings. – Andrew Gustar Oct 14 '17 at 22:42
  • So it is a specific issue, that randomly occurred because I am unlucky, so to speak? The problem is, even though these names aren't my actual names, the distances between these names and my actual names are the same (picked them for a reason). Is there any way to make it work automatically? As I need my algorithm to work for all cases, even such unlucky cases as these.. So, can I do anything about it, if I keep these same strings? – Ravonrip Oct 14 '17 at 22:51
  • You could try rounding the heights after calculating clusts - try `clusts$height <- round(clusts$height, 6)` – Andrew Gustar Oct 15 '17 at 06:02

1 Answers1

4

If you compare diff(clusts$height) for these two examples, the first comes out as a tiny negative number, the second as exactly zero. So the problem is caused by binary-representation rounding differences in values that should be equal.

It should work if you round the heights after calculating clusts...

clusts$height <- round(clusts$height, 6) 
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
  • I have a similar problem but rounding didn't work. I'm working with mpg dataset from ggplot2 and you are right , I do have negative numbers when running diff on heights, but after rounding I still keep getting the same message. Any ideas? – Jorge Lopez Feb 20 '19 at 08:44
  • It is hard to say without more details. How are you applying `mpg` to `hclust`? Note that the input to `hclust` needs to be a distance matrix - see `?hclust` for details – Andrew Gustar Feb 21 '19 at 10:14