1

I have two dataframes, one with raw data labels and one with the correct adjusted values the data needs to be matched to. The labels are numeric but can differ up to +/- 2. I am trying to figure out how to write a coded if/then loop since amatch does not work well for numerics. The goal is to have a loop where for every value in the raw data, it will check against the values in the corrected data and match to the closest match if one is present where corrected - raw is between -2 and 2. I have pasted my attempted but very much nonfunctional attempt below.

My thought is that it may be possible to use amatch to select the best/closest match when one is found, since in some cases there are several data label values close together while in others there is a shift in the number up to 2.

Is there a way to write such a code, or another better way to accomplish this? The goal is to have a corrected column matched to the raw data labels that I can then use to merge with the raw data and the additional metadata attached to the corrected labels, but for my full list of labels only about 60% match without needing this adjustment (you can see in the sample data, for example, 1910 should match to 1911 and 2056 needs to match to 2057). Because of the nature of the data, the differences are not consistent and I want this to be a function that I can apply to more than just a single instance of data so that I do not have to go through and match every data label by hand.

raw <- c(1419, 1444, 1460, 1485, 1501, 1542, 1581, 1590, 
         1606, 1622, 1647, 1663, 1688, 1704, 1743, 1791, 
         1793, 1809, 1850, 1866, 1891, 1905, 1910, 1954, 
         1956, 1976, 1996, 2012, 2028, 2041, 2053, 2056, 
         2067, 2100, 2102, 2122)

corrected <- c(1419, 1444, 1460, 1485, 1501, 1542, 1562, 
               1581, 1590, 1606, 1622, 1630, 1647, 1663, 
               1688, 1704, 1743, 1792, 1793, 1809, 1825, 
               1834, 1850, 1866, 1891, 1905, 1911, 1914, 
               1938, 1954, 1955, 1971, 1976, 1996, 2012, 
               2019, 2028, 2053, 2057, 2100, 2101, 2122)


labelmatch <- function(x, y) {data.frame(glycan=x, glycan_name=
                                            (for(i in 1:length(x)) {
                                              for(n in 1:length(y)) {
                                                if (n-i <= 2 & n-i >=-2) {
                                                  match(x, y)} else{
                                                    if (n-i >= 2 | n-i <=-2){
                                                  next}}}}))
}

labelmatch(raw, corrected)
  • Any reason your values are all strings? You can't e.g. subtract 1 from a string. – Axeman Jul 06 '23 at 17:40
  • what if there are multiple matches to raw? in raw 1956 matches to 1954 and 1955 – Mike Jul 06 '23 at 17:49
  • My hope was that when there are multiple matches I can make it where it selects the closest match but I am not sure how to do that. – Elizabeth Wallace Jul 06 '23 at 18:12
  • Axeman, good point, I changed the values back to numbers. That was my bad! The code still does not work though, so the general problem applies – Elizabeth Wallace Jul 06 '23 at 18:17
  • What happens when there is no close match within 2? ie`2041` does not have a close match within 2 or even `2067` – Onyambu Jul 06 '23 at 18:20
  • Onyambu, that would be when the code would skip to the next iteration of the loop, typically meaning that the data label does not match to a real label in the list. Although it might be nice if I could make it so that the second column prints NA where there is no match so I could manually go through the results and see which labels did not have a match. – Elizabeth Wallace Jul 06 '23 at 18:22
  • Check the solution below: – Onyambu Jul 06 '23 at 18:51

3 Answers3

1

Since your corrected data is sorted, we can use that fact to quickly search through the vector. Inspired by np.searchsorted

searchsorted <- function(findIn, vec, isSorted = TRUE){
  if(!isSorted) findIn <- sort(findIn)
  idx <- rank(c(vec, findIn, -Inf),, 'first')[seq_along(vec)] - rank(vec)
  right_vals <- findIn[idx]
  left_vals <- findIn[(idx2<-idx - 1) + !idx2]
  right_vals[na_idx] <- left_vals[na_idx<- is.na(right_vals)]
  right_vals[idx2] <- left_vals[idx2<- abs(right_vals - vec) > abs(left_vals - vec)]
  is.na(right_vals) <- abs(right_vals - vec) > 2
  right_vals
}
searchsorted(corrected, raw)
[1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
[16] 1792 1793 1809 1850 1866 1891 1905 1911 1954 1954 1976 1996 2012 2028   NA
[31] 2053 2057   NA 2100 2100 2122

--

Edit:

R does provide the function findInterval which could be used to simplify the task:

searchsorted <- function(x, vec){
  idx <- findInterval(x, vec, all.inside = TRUE)
  vals <- vec[idx]
  idx2 <- abs(vals - x) > 2
  vals2 <- vec[idx[idx2] + 1]
  is.na(vals2) <- vals2 - x[idx2] > 2
  replace(vals, idx2, vals2)
}

searchsorted(raw, corrected)
 [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704
[15] 1743 1792 1793 1809 1850 1866 1891 1905 1911 1954 1955 1976 1996 2012
[29] 2028   NA 2053 2057   NA 2100 2101 2122
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thank you!!!! This works wonderfully! – Elizabeth Wallace Jul 06 '23 at 19:19
  • Update: while this does work, it seems to only filter in one direction (1 up or 1 down) depending on how I adjust the idx-1. – Elizabeth Wallace Jul 12 '23 at 15:45
  • @ElizabethWallace 1 up or 1 down is the only one that is closer to the number. eg given any number ir 4, then given a sorted array [1,3,6,8] we know that 4 is between 3 and 6. so we only interested in 3 and 6 and nothing else. Actually after writing this code, i noticed that R does have a function to do the task you have at hand. ie `findInterval` – Onyambu Jul 12 '23 at 15:51
  • where in the code you suggested would be best to sub that in? Thanks! – Elizabeth Wallace Jul 12 '23 at 16:15
0

One way to do this would be to expand all options of raw and corrected using expand.grid() and keeping the matches where the difference is between -2 and 2. I also create a vector where the raw values do not have a match so you can look at those values manually.

library(dplyr)
labelmatch <- function(x,y){
expand.grid(x,y) %>% 
        mutate(diff = Var2-Var1) %>% 
        filter(between(diff,-2,2))
}

labels2 <- labelmatch(raw,corrected)

With expand grid being to computationally expensive maybe this solution is a bit more efficient. first I make a matrix subtracting the raw and corrected values, then I find the indices where they are between -2, and 2, from there I make a data frame of the pairs where the values are between -2, and 2, I used dplyr to keep unique pairs where the match is closest to the raw value.

x1 <- sapply(corrected, function(x){
  x - raw
})
row.names(x1) <- raw
colnames(x1) <- corrected

s1 <- which(x1 <= 2 & x1 >= -2, arr.ind = TRUE)
x2 <- data.frame(rows =  rownames(x1)[s1[,1]], cols =  colnames(x1)[s1[,2]], 
           values = x1[s1])

rawunlabeld <- setdiff(raw, x2$rows)

x3 <- x2 %>% 
      dplyr::group_by(rows) %>% 
      dplyr::filter(abs(values) == min(abs(values))) %>% 
      dplyr::distinct()
Mike
  • 3,797
  • 1
  • 11
  • 30
  • 2
    This is not scalable eg. what if `x` is of length 10,000 and `y` is of length `100,000`??? – Onyambu Jul 06 '23 at 18:03
  • Yeah, when I try this I get: Error in `mutate()`: ℹ In argument: `diff = Var2 - Var1`. Caused by error in `Ops.data.frame()`: ! ‘-’ only defined for equally-sized data frames – Elizabeth Wallace Jul 06 '23 at 18:11
0

Probably you can try

> raw * NA^(!colSums(abs(outer(corrected, raw, `-`)) <= 2))
 [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
[16] 1791 1793 1809 1850 1866 1891 1905 1910 1954 1956 1976 1996 2012 2028   NA
[31] 2053 2056   NA 2100 2102 2122
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81