Find n-1 closest values based on criteria in a dataframe in R

Question

I have a df with data from a qPCR run:

df_1 <- structure(list(
  row = c("A", "A", "A", "A", "B", "B"), 
  column = c(17L, 18L, 19L, 20L, 17L, 18L), 
  Treatment = c("Clp-1", "Clp-1","Clp-1", "Clp-1", "Clp-1", "Clp-1"), 
  Time = c("1h", "1h", "1h", "1h", "1h", "1h"), 
  Sample_Nr = c("1.1", "1.1", "1.1", "1.1", "1.2", "1.2"), 
  Target_Name = c("ClP-1", "ClP-1", "ClP-1", "ClP-1", "ClP-1", "ClP-1"), 
  Task = c("UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN","UNKNOWN"), 
  Reporter = c("SYBR", "SYBR", "SYBR", "SYBR", "SYBR", "SYBR"), 
  CT = c(30.7594337463379, 29.7701301574707,31.2958374023438, 
         29.883508682251, 28.765043258667, 28.3563442230225)), 
  row.names = c(NA, 6L), class = "data.frame")

This is an example from the df

I'm trying to find the n-1 closest Ct values based on the criteria "Sample_Nr" & "Target_Name" to calculate their average for downstream analysis.

I found this solution online so far:

n = 4
df_1 <- df %>% group_by(Sample_Nr,Target_Name, Treatment, Time) %>% 
count("CT") %>% do(data.frame(findClosest(.$CT,n)))

Based on: How to find the three closest (nearest) values within a vector?

My Problem now is that "n" is a fixed value but sometimes I have just three Ct values instead of four of each technical replicate (The missing one will be a "NA" in the df). In such a case the findClosest() function can't be applied to the df as the n by default would be 4. (Usually four technical replicates per condition).

How can I still use this function but adjusted to the number of Ct values I have for each condition?

So far I've tried the following:

a = df %>% group_by(Sample_Nr,Target_Name, Treatment, Time) %>% filter(!is.na(CT)) 
Vector_df1<−c(table(a$Sample_Nr, a$Target_Name))

I tried to pass "Vector_df1" as my new "n" to findClosest() but this doesn't work.

Error message:

There were 50 or more warnings (Show first 50 warnings using warnings())

Warning:
1: Unknown or uninitialised column: CT.
2: In 0:(n - 1) : numeric expression has 81 elements: only first one is used.
...
49: Unknown or uninitialised column: CT.
50: In 0:(n - 1) : numeric expression has 81 elements: only first one is used.

PS:
I apologize if this post is too long or anything. I tried to be precise and include all relevant information. It's also my first post.

"Example of qPCR data" is not data. It is an image. Please use `dput(head(data,df))`. — langtang, Aug 12 '22 at 17:25

Rui Barradas · Accepted Answer · 2022-08-12T20:36:45.557

0

Here is a way. Change function findClosest to check whether the vector length is not less than n.

suppressPackageStartupMessages({
  library(dplyr)
})

findClosest <- function(vec, n) {
  require(zoo)
  if(n > length(vec)) n <- length(vec)
  vec1 <- sort(vec)
  m1 <- rollapply(vec1, n, by = 1, function(i) c(sum(diff(i)), c(i)))
  return(m1[which.min(m1[, 1]),][-1]) 
}

n <- 4
df_1 %>%
  group_by(Sample_Nr, Target_Name) %>%
  summarise(Closest = findClosest(CT, n), .groups = "drop")
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
#> # A tibble: 6 × 3
#>   Sample_Nr Target_Name Closest
#>   <chr>     <chr>         <dbl>
#> 1 1.1       ClP-1          29.8
#> 2 1.1       ClP-1          29.9
#> 3 1.1       ClP-1          30.8
#> 4 1.1       ClP-1          31.3
#> 5 1.2       ClP-1          28.4
#> 6 1.2       ClP-1          28.8

^{Created on 2022-08-12 by the reprex package (v2.0.1)}

Edit

To keep the n - 1 rows that minimize the variance of Closest, I have written an auxiliary function smallest_var. It computes the variances of the combinations of the n elements of its input by groups of n-1 and returns the indices of the first minimum. Then those indices are matched to the row number and only the ones matching are filtered.

smallest_var <- function(x) {
  n <- length(x)
  if(n > 2) {
    inx <- combn(seq_along(x), n - 1L)
    v <- apply(inx, 2, \(i) var( x[i] ))
    inx[, which.min(v) , drop = TRUE]
  } else seq_along(x)
}

n <- 4
df_1 %>%
  group_by(Sample_Nr, Target_Name) %>%
  summarise(Closest = findClosest(CT, n)) %>%
  filter(row_number() %in% smallest_var(Closest)) %>%
  ungroup()
#> `summarise()` has grouped output by 'Sample_Nr', 'Target_Name'. You can
#> override using the `.groups` argument.
#> # A tibble: 5 × 3
#>   Sample_Nr Target_Name Closest
#>   <chr>     <chr>         <dbl>
#> 1 1.1       ClP-1          29.8
#> 2 1.1       ClP-1          29.9
#> 3 1.1       ClP-1          30.8
#> 4 1.2       ClP-1          28.4
#> 5 1.2       ClP-1          28.8

^{Created on 2022-08-12 by the reprex package (v2.0.1)}

edited Aug 12 '22 at 20:36

answered Aug 12 '22 at 17:50

Rui Barradas

70,273
8
34
66

Using this code I get the following error: Error in `summarise()`: ! Problem while computing `Closest = findClosest(CT, n)`. i The error occurred in group 3: Sample_Nr = "1.1", Target_Name = "VlS17". Caused by error in `m1[, 1]`: ! wrong number of dimensions. In the df this group is the first one to contain a "NA" for a Ct value. – LePlant Aug 12 '22 at 18:13
I can bypass this problem by removing all NAs before passing the df to the function. – LePlant Aug 12 '22 at 18:41
Just one more thing: How can I drop/remove the most distant Ct value from the df while or after applying `findClosest()` when n = 4? – LePlant Aug 12 '22 at 18:44
@LePlant I believe so, yes. But the data you have posted only has 4 rows in group1 and 2 in group 2, like this it's hard to test. – Rui Barradas Aug 12 '22 at 18:59
I tried to update df_1 to contain a bit more data but Stack overflow is complaining about it. Even if I don't change anything I'm unable to save my post while in "editing mode". "Your post appears to contain code that is not properly formatted as code. Please indent all code by 4 spaces using the code toolbar button or the CTRL+K keyboard shortcut." – LePlant Aug 12 '22 at 19:07
@LePlant See if the edit solves the problem. I filter by minimum, if by *distant* you mean maximum, change to `max`. – Rui Barradas Aug 12 '22 at 19:32
@max If I understand your code correctly it removes the smallest value, correct? What I want to do is to drop the outlier so the one value among the four which is the farthest away from the other three. To be even more precise, I want to identify the three values which give me the smallest variance. By the nature of a qPCR it's unlikely that the Ct values are such that the closest three won't be the three that minimize the variance at the same time. – LePlant Aug 12 '22 at 19:46
@LePlant I think it now is what you are asking for, see the new edit. – Rui Barradas Aug 12 '22 at 20:37
Thank you for your patience with me and for the edits. I'll try your solution tomorrow and keep you updated. – LePlant Aug 12 '22 at 20:58
I tried your solution @Rui Barradas. It works! Just one question. Do I still need the `findClosest()` function with your auxiliary function `smallest_var()`? If I understand correctly, `findClosest()` arranges the values according to absolute distance of them and arranges them accordingly. Is this needed as a previous step in order for `smallest_var()` to work properly? – LePlant Aug 17 '22 at 16:08
@LePlant Yes, you do. `findClosest` keeps the `n` closest values, `smallest_var` finds which group of `n-1` of those values has the smallest variance. The `n` variables must be found first, then compute the variances. – Rui Barradas Aug 17 '22 at 16:30

Find n-1 closest values based on criteria in a dataframe in R

1 Answers1

Edit