2

I have a biological data set where I want to calculate the distance between centroids and each centroid represents a given year (so distance is calculated sequentially). I'm exploring usedist::dist_between_centroids() to calculate the distance in high dimensional space, but it seems quite arduous since the function requires vector inputs of the grouping variables (in this case, year). I've explored vegan::adonis() as an alternative function, but I can't figure out how to extract the distances. I've attached some sample data using Dune and recoded one of the factors as 'year.' My actual dataset consists of ~20 years worth of data, so manually calculating distances as I've done below is not practical. I think a loop with dist_between_centroids() might accomplish this task, but I'm not sure how to specify the grouping vectors in the loop.


# Species and environmental data
require(vegan)
require(usedist)

dune <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)

dune.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)

data(dune) 
data(dune.env)

all_data <- cbind(dune.env, dune) %>%
              arrange(Use)

all_data$Use <- recode_factor(all_data$Use, "Hayfield"="2017")
all_data$Use <- recode_factor(all_data$Use, "Haypastu"="2018")
all_data$Use <- recode_factor(all_data$Use, "Pasture"="2019")


bio_data <- all_data[,6:35] 

bio_distmat <- vegdist(bio_data, method = "bray", na.rm=T) 


#store distance in matrix
dist_between_mat <- as.data.frame(matrix(ncol=3, nrow=2))
colnames(dist_between_mat) <- c("start_centroid","end_centroid","distance")

dist_between_mat[1,1] = "2017"
dist_between_mat[1,2] = "2018"
dist_between_mat[1,3] = dist_between_centroids(bio_distmat, 1:7,8:15) #distance between 2017 and 2018

dist_between_mat[2,1] = "2018"
dist_between_mat[2,2] = "2019"
dist_between_mat[2,3] = dist_between_centroids(bio_distmat, 8:15,16:20) #distance between 2018 and 2019


Joshua Smith
  • 125
  • 6
  • Is `bio_data` a distance matrix? The documentation of `usedist::dist_between_centroids(d, ...)` explains that `d` is a distance matrix object. But it seems to me that `bio_data` might be the raw data/observations, not the distances? – dipetkov Jul 27 '22 at 07:35
  • Thank you for pointing that out. I added the distmat. – Joshua Smith Jul 27 '22 at 15:57

2 Answers2

1

vegan::adonis (or vegan::adonis2) does not return that information. vegan::betadisper does. Its result object contains distances which are the distances to the respective group centroid, and element group has the information of the corresponding group. If you want only one group, you must give a constant vector as the group.

Jari Oksanen
  • 3,287
  • 1
  • 11
  • 15
  • 1
    Maybe we should expose this functionality in *vegan* by taking the code from the internals of `betadisper` into a new function and export it? – Gavin Simpson Jul 27 '22 at 09:34
  • If I understand `betadisper` correctly it calculates the average distance between group members and their respective group centroid. This is more of a measure of dispersion. I am looking for a function that calculate the distance between the group centroids (e.g., distance from the 2017 centroid to the 2018 centroid, then the 2018 centroid to the 2019 centroid). – Joshua Smith Jul 27 '22 at 16:33
  • `betadisper` also returns the principal coordinates of centroids, and these can be used to calculate the distances among centroids. Is this the data you are looking for? However, you need to be careful with negative eigenvalues, and if you have them, you must use for result object `m` something like `sqrt(dist(m$centroids[,m$eig>0])^2 - dist(m$centroids[,m$eig<0]^2))`. If you only have positive eigenvalues, you can use use just `dist(m$centroids)`. – Jari Oksanen Jul 28 '22 at 11:06
  • The `usedist::dist_between_centroids` seems to give the same answers as the method I outlined above with distances that *can* be mapped into Euclidean space (_i.e._ have no negative eigenvalues in `betadisper`). However, the results are different with semimetric and non-metric indices which have negative eigenvalues, such as the one you have in your example (Bray-Curtis with `vegdist()`). NB., *if* you use `betadisper`, you should set `type = "centroid"` for your purpose. – Jari Oksanen Jul 28 '22 at 11:42
1

You can do this with a simple for-loop. But why write simple code when we can use "tidy" principles instead?

Here is a solution that iterates over the start years and the end years, generates a one-row tibble and then concatenates the rows into a final tibble.

Note that in your reproducible example the years/levels are in reverse chronological order. I use the levels ordering, without casting the levels to years, so make sure that this is the order you intend.

levels(all_data$Use)
#> [1] "2019" "2018" "2017"

n <- nlevels(all_data$Use)

start <- levels(all_data$Use)[1:(n - 1)]
start
#> [1] "2019" "2018"
end <- levels(all_data$Use)[2:n]
end
#> [1] "2018" "2017"

map2_dfr(start, end, ~ {
  idx1 <- which(all_data$Use == .x)
  idx2 <- which(all_data$Use == .y)
  tibble(
    start_centroid = .x,
    end_centroid = .y,
    distance = dist_between_centroids(bio_distmat, idx1, idx2)
  )
})
#> # A tibble: 2 × 3
#>   start_centroid end_centroid distance
#>   <chr>          <chr>           <dbl>
#> 1 2019           2018            0.210
#> 2 2018           2017            0.327

Created on 2022-07-27 by the reprex package (v2.0.1)

dipetkov
  • 3,380
  • 1
  • 11
  • 19