How can I, in R, use a greedy site selection algorithm that maximises unrepresented species richness?

Question

I would like to use a 'greedy' algorithm that would select sites such as to maximize unrepresented species richness. The first site selected would be the one with the highest richness,while the second would be the one with highest unrepresented (i.e., not already selected) richness and so on. If two sites had the same unrepresented richness, then the site with the most rare species would be selected.

The richness-based greedy algorithm I aim to use could work as follows:

Select Richness (site with highest number of species not already represented)
Ties broken by Representation (site supporting species with the lowest total representation among sites)
Further ties broken by Random.

The input data would be a species abundance matrix (rows = sites, cols = species), and the output a ranked list of sites to select.

Although I have found a range of descriptions for similar greedy algorithms in publications, I have not yet been able to find available software or R packages/code that would do that.

The closest I found is the function 'optim_species' of the package "ausplotsR", which uses a range of diversity metrics to optimize. However, in this function's output I obtain site numbers and how increasing them influences said metric, without getting the identity of each site. Note that the function output includes one ranked list of sites, but does not say for which diversity metric it was calculated.

See the following:

# load package
library(ausplotsR)

# create test matrix
species_matrix <- matrix(
  c(1, 0, 1, 1, 0, 0,
    0, 0, 1, 0, 0, 0,
    0, 1, 1, 0, 0, 0,
    0, 1, 0, 0, 1, 0,
    0, 0, 0, 0, 0, 1),
  nrow = 5,
  ncol = 6,
  byrow = TRUE,
  dimnames = list(c("Site 1", "Site 2", "Site 3", "Site 4", "Site 5"), c("Species 1", "Species 2", "Species 3", "Species 4", "Species 5", "Species 6"))
)

# optim_species' test
optim_species(species_matrix, n.plt = 5, frequent = FALSE)

In this simple example, optim_species provides a list of ranked sites that do maximise richness, but with my dataset (species abundance for 130 species in 18 sites) the ranked sites do not necessarily optimize richness.

Is there a way to obtain the identity of the listed Sites for each diversity metric in this package? Or is there other alternatives that would produce a similar result?

It seems like you are asking a few different questions: 1. how can I create/where can I find a function which identifies species richness, 2. why isn't optim_species working in the way I expect, and 3. how can I then create an algorithm to select data which operates according to the rules I set out? — Mark, Jul 28 '23 at 03:31
My apologies if I was not clear enough in my question. 1. identifying species richness, total or at the plot level, is not a problem. 2. Yes, if optim_species worked as expected then that would solve my issue, especially if it produced a ranked list of sites for each diversity metric. 3. Alternatively, creting such an algorithm would solve my issue. Would you recommend submitting these as separate questions? — Francois Brassard, Jul 30 '23 at 11:40
re:`My apologies if I was not clear enough in my question`: no no, I just wanted to verify, considering it seemed like there were a lot of moving parts in it — Mark, Jul 30 '23 at 12:15

score 2 · Answer 1 · answered Jul 31 '23 at 20:24

We can implement a greedy approach without any packages.

f <- function(m) {
  n <- nrow(m)
  df <- data.frame(Site = character(n), CumRich = integer(n))
  cs <- 0L
  
  for (i in 1:n) {
    rs <- rowSums(m)
    j <- which.max(rs)
    cs <- cs + rs[j]
    df[i, 1] <- rownames(m)[j]
    df[i, 2] <- cs
    
    if (rs[j] == ncol(m)) {
      df[(i + 1L):n, 1] <- rownames(m)[-j]
      df[(i + 1L):n, 2] <- cs
      break
    }
    
    m <- m[,m[j,] == 0, drop = FALSE][-j,,drop = FALSE]
  }
  
  df
}

Testing:

f(species_matrix)
#>     Site CumRich
#> 1 Site 1       3
#> 2 Site 4       5
#> 3 Site 5       6
#> 4 Site 2       6
#> 5 Site 3       6

The larger problem described can be solved almost instantly:

set.seed(283637553)
species <- 130
sites <- 18
species_matrix <- matrix(
  +(runif(species*sites) < 0.2), sites, species,
  dimnames = list(paste("Site", 1:sites), NULL)
)

system.time(print(f(species_matrix)))
#>       Site CumRich
#> 1  Site 18      35
#> 2  Site 17      61
#> 3   Site 5      82
#> 4   Site 7      95
#> 5  Site 16     106
#> 6   Site 3     113
#> 7   Site 8     118
#> 8  Site 10     122
#> 9   Site 4     124
#> 10 Site 13     126
#> 11  Site 2     127
#> 12  Site 6     128
#> 13 Site 12     129
#> 14  Site 1     129
#> 15  Site 9     129
#> 16 Site 11     129
#> 17 Site 14     129
#> 18 Site 15     129
#>    user  system elapsed 
#>    0.01    0.00    0.01

This works perfectly. Thank you! – Francois Brassard Aug 11 '23 at 02:26 — Francois Brassard, Aug 11 '23 at 02:26

ThomasIsCoding · Answer 2 · 2023-08-01T06:56:26.477

Basic Idea: Recursion for Greedy

Since you are looking for greedy algorithm that pursues the maximization of richness, I guess you can implement it in a recursive manner, which always searches the Site that can maximize richness resulted by the previous selections.

Code Example

For example, you can define a custom function f like below

f <- function(m, i = nrow(m)) {
    if (i == 1) {
        rs <- rowSums(m)
        return(sample(names(which(rs == max(rs))), 1))
    }
    ridx <- Recall(m, i - 1)
    cidx <- names(which(colSums(m[ridx, , drop = FALSE]) > 0))
    rx <- !rownames(m) %in% ridx
    cx <- !colnames(m) %in% cidx
    mm <- m[rx, cx, drop = FALSE]
    rs <- rowSums(mm)
    if (diff(range(rs)) == 0) {
        return(c(ridx, names(which.min(rowSums(m[rx, , drop = FALSE])))))
    } else {
        return(c(ridx, sample(names(which(rs == max(rs))), 1)))
    }
}

and you will obtain

> f(species_matrix)
[1] "Site 1" "Site 4" "Site 5" "Site 2" "Site 3"

Bonus

If you would like to see the evolution of richness after each Site selection, you can have the following command on top of the result f(species_matrix), e.g.,

> Reduce(`|`, asplit(species_matrix[f(species_matrix), ]>0, 1), accumulate = TRUE)
[[1]]
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6
     TRUE     FALSE      TRUE      TRUE     FALSE     FALSE

[[2]]
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6
     TRUE      TRUE      TRUE      TRUE      TRUE     FALSE

[[3]]
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6
     TRUE      TRUE      TRUE      TRUE      TRUE      TRUE

[[4]]
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6
     TRUE      TRUE      TRUE      TRUE      TRUE      TRUE

[[5]]
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6
     TRUE      TRUE      TRUE      TRUE      TRUE      TRUE

How can I, in R, use a greedy site selection algorithm that maximises unrepresented species richness?

2 Answers2

Basic Idea: Recursion for Greedy

Code Example

Bonus