0

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition. Each record has date, time etc and presence/absence data (0 / 1) for each species.

I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).

I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:

tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary", 
                   nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)

This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).

example of dendrogram obtained with pvclust function

As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.

Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters. So my questions would be

  • is there anything I got wrong in the pvclust function itself?
  • may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
  • are there other functions in R I could try for hierarchical clustering that also deliver p-values? Thanks in advance!

............. I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

  • 1
    Without seeing a sample of your data it is not possible to do more than guess. You mention that there is "date, time, etc" information, but this would usually not be included in the analysis. Secondly, the manual page for the function `pvclust` in package `pvclust` gives only three options for `method.dist` and none of them is "binary" so it is not clear this method is appropriate for presence/absence data (check with the package maintainer). Third, test your code first on a subset of the data, e.g. 100 or 200 of your "thousands of records". – dcarlson Sep 17 '20 at 17:24
  • Hi @dcarlson, thank you for your interest! So date, time etc are NOT in the presence/absence matrix, so won't influence clustering. Secondly, if you look at rdocumentation, it says options for method.dist are "one of "correlation", "uncentered", "abscor" OR THOSE WHICH ARE ALLOWED FOR METHOD ARGUMENT IN DIST FUNCTION", which includes "binary". I've tried with 500 records - significance is very low for nboots=100. Here I get an output for nboot=1000, I'll add the graph to my post. – Julia Gostischa Sep 23 '20 at 09:04

0 Answers0