2

I want to calculate the dissimilarity indices on a binary matrix and have found several functions in R, but I can't get them to agree. I use the jaccard coefficient as an example in the four functions: vegdist(), sim(), designdist(), and dist(). I'm going to use the result for a cluster analysis.

library(vegan)
library(simba)

#Create random binary matrix
function1 <- function(m, n) {
  matrix(sample(0:1, m * n, replace = TRUE), m, n)
}
test <- function1(30, 20)

#Calculate dissimilarity indices with jaccard coefficient
dist1 <- vegdist(test, method = "jaccard")
dist2 <- sim(test, method = "jaccard")
dist3 <- designdist(test, method = "a/(a+b+c)", abcd = TRUE)
dist4 <- dist(test, method = "binary")

Does anyone know why dist1 and dist4 are different from dist2 and dist3?

Andy Clifton
  • 4,926
  • 3
  • 35
  • 47
  • 2
    Have you studied the documentation? If that didn't provide the answer, have you studied the source code? – Roland Mar 08 '16 at 15:22
  • And then you are forgetting the package stringdist :-). which has a metric for doing a jaccard distance, but based on q-gram profiles. – phiver Mar 08 '16 at 19:50
  • I have not been able to get an answer to my question from the documentation and I'm not so strong in reading source codes, which is why I hope for you guys help. – Magnus Hallas Mar 09 '16 at 12:04
  • If you want to use binary dissimilarities in **vegan** `vegdist`, you have to say so: Use `vegdist(test, method="jaccard", binary=TRUE)`. Your equation for `dist3` defines *similarities* instead of **dis**similarities. For dissimilarity, use `designdist(test, "(b+c)/(a+b+c)", abcd=TRUE)` or `1 - designdist(test, "a/(a+b+c)", abcd=TRUE)`. Judging from the name, `sim` function also defines similarities. In **R**, you normally need dissimilarities in cluster analysis, at least when using standard tools. – Jari Oksanen Mar 09 '16 at 14:00

1 Answers1

1

I put this as an answer as well. Here the main comments for the dissimilarities you calculated:

  • dist1: you must set binary=TRUE in vegan::vegdist() (this is documented).

  • dist2: simba::sim() calculates Jaccard similarity and you must use 1-dist2. The ?sim documentation gives a wrong formula for Jaccard similarity, but uses the correct formula in code. However, the documented formula defines a similarity.

  • dist3: Your vegan::designdist() formula gives Jaccard similarity and you should change it to dissimilarity. There are many ways of doing this, and the code below gives one.

  • dist4: this is correctly done.

Replacing your four last lines with these will do the trick and give numerically identical results with all functions:

#Calculate dissimilarity indices with jaccard coefficient
dist1 <- vegdist(test, method = "jaccard", binary = TRUE)
dist2 <- 1 - sim(test, method = "jaccard")
dist3 <- designdist(test, method = "(b+c)/(a+b+c)", abcd = TRUE)
dist4 <- dist(test, method = "binary")
Jari Oksanen
  • 3,287
  • 1
  • 11
  • 15
  • Thank you. I wasn't aware of the differences between the functions (similarity or dissimilarity). – Magnus Hallas Mar 10 '16 at 14:52
  • All standard **R** tools assume you have **dis**similarities. This is a design choice. Some other software have different design choices, but while in **R** do the **R** way. – Jari Oksanen Mar 11 '16 at 07:48