Create and assign groups based on overlapping/nonoverlapping values between two columns

Question

I have a data frame which includes min and max columns:

df <- data.frame(min=c(2, 4, 3, 3, 2, 6),
                 max=c(2.9, 5.9, 3.9, 4.9, 7.9, 7.9))

I am interested in creating and assigning groups based on overlap/non-overlap between the two columns, where rows are given a new letter if they overlap with no previous rows, and then that letter is given to all rows which overlap with that row.

For example, for the first two rows, there is no overlap, so they get different letters ("a" and "b" respectively). But row 1 overlaps with row 5, so row 5 gets the letter a too. If a row is completely inside another it would receive the same letter code.

For the data frame above, the desired output would be the group column:

  min max group
1   2 2.9     a
2   4 5.9     b
3   3 3.9     c
4   3 4.9    bc
5   2 7.9  abcd
6   6 7.9     d

Essentially this is similar to using letter codes to denote statistical significance (in the spirit of "compact letter display" cld::multcomp) but these are max and min values. I have a large data frame and would like to automate this process with the ultimate goal of placing the letter code above a geom_errorbars in ggplot2.

Trying suggestion by @Mark works but seems to give a different set of answers when the max/min columns have more than 1 decimal place. Any idea what's happening?

structure(list(site = structure(1:9, levels = c("TR", "SF", 
"AR", "MR", "PD", "PL", "GC", "WM", "EM"), class = c("ordered", 
"factor")), mean = c(18.160173892, 17.4521769872, 12.078365989, 
20.4040729028, 25.272373546, 29.216489076, 7.1171799882, 7.3105614156, 
9.781365245), min = c(14.1, 12.6, 9.2, 15.1, 19.8, 22.3, 3.9, 
4.8, 6.2), max = c(22.3, 22.9, 15.2, 31.2, 31.3, 36.5, 11.1, 
9.8, 14.7), min2 = c(14.12, 12.62, 9.21, 15.06, 19.76, 22.26, 
3.87, 4.84, 6.24), max2 = c(22.35, 22.95, 15.22, 31.16, 31.32, 
36.49, 11.11, 9.84, 14.7), overlap = c("a", "ab", "ac", "ab", 
"ab", "b", "c", "c", "ac"), overlap2 = c("a", "a", "ab", "a", 
"a", "a", "b", "b", "ab")), row.names = c(1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L), class = "data.frame")

It is much easier to see with the plots.

ggplot(data=testdf2, aes(x = site, y = mean)) +
  geom_errorbar(aes(ymin=min, ymax=max), size=1, width=0.3) +
  geom_col(alpha=0.5) +
  ylab("One-decimal min max") +
  geom_text(aes(label = overlap, y = max), vjust = -.5) +
  theme_bw()

ggplot(data=testdf2, aes(x = site, y = mean)) +
  geom_errorbar(aes(ymin=min2, ymax=max2), size=1, width=0.3) +
  geom_col(alpha=0.5) +
  ylab("Two-decimal min max") +
  geom_text(aes(label = overlap2, y = max2), vjust = -.5) +
  theme_bw()

What would the desired result be if you had 2 rows, with one being completely inside the other, e.g., `min = c(2,0), max = c(3, 10)`? — Gregor Thomas, Aug 14 '23 at 18:18
What would the expectation be if you added a row `min=1.5, max=2.5`? That is, all of your current rows are perfectly distinct or perfectly overlapping, but none of them are partials like 1.5/2.5. — r2evans, Aug 14 '23 at 18:19
@r2evans I made it clearer (as to what I think the rules are, based on my answer, and OP's positive response). Take a look if you like :-) — Mark, Aug 16 '23 at 06:54
@GregorThomas I might be wrong but I believe the answer would be c("a", "a") based on my positively received answer :-) — Mark, Aug 16 '23 at 06:55
Welcome to SO, Bradley S! If the answer addresses your question, please [accept it](https://stackoverflow.com/help/someone-answers); doing so not only provides a little perk to the answerer with some points, but also provides some closure for readers with similar questions. If a question receives multiple answers, though you can only accept one answer, you have the option to up-vote as many as you think are helpful. (You should be able to accept and up-vote even though the question was closed. A "thanks" comment is insufficient for this forum.) Thank you! — r2evans, Aug 16 '23 at 12:29
@GregorThomas Yes the idea is that a contained row would have the same letter code. — Bradley S, Aug 17 '23 at 17:31
@r2evans In your example a partial would have the letter codes associated with each other row it is partially contained. I added more clarity in the question details and explained that this is similar to compact letter display. — Bradley S, Aug 17 '23 at 17:34

score 0 · Accepted Answer · answered Aug 14 '23 at 18:58

# make an empty column for the overlaps
df$overlap <- ""

# create letter index
letter <- 1

while (any(df$overlap == "")) {
    # get the first row that has an NA overlap
    row <- which(df$overlap == "")[1]

    # if the ranges overlap, then add the letter to the overlap column, otherwise add nothing
    new_vals <- if_else(df$min < df$max[row] & df$max > df$min[row], letters[letter], "")
    
    # increment letter index
    letter = letter + 1

    # add the new values to the overlap column
    df$overlap <- paste0(df$overlap, new_vals)
}

Create and assign groups based on overlapping/nonoverlapping values between two columns

1 Answers1