Use of lapply to identify which bin a particular value lies in

Question

The data set is this

badData <- list(c(296,310), c(330,335), c(350,565))
df <- data.frame(wavelength = seq(300,360,5.008667),
                  reflectance = seq(-1,-61,-5.008667))
df    
   wavelength reflectance
   300.0000   -1.000000
   305.0087   -6.008667
   310.0173  -11.017334
   315.0260  -16.026001
   320.0347  -21.034668
   325.0433  -26.043335
   330.0520  -31.052002
   335.0607  -36.060669
   340.0693  -41.069336
   345.0780  -46.078003
   350.0867  -51.086670
   355.0953  -56.095337

The orginal question was whether to identify if wavelength fell in any of the ranges given in badData The solution offered is this https://stackoverflow.com/a/52070363/1012249

my question is using a similar syntax, how does one identify which badData bin it falls into. Lets say badData were structured like this, and bins are non-overlapping.

badData <- data.frame(bin=c('a','b','c'),start= c(296,330,350),end=c(310.01,335,565))

score 2 · Answer 1 · answered Aug 29 '18 at 08:56

Here is an example using fuzzy join:

library(fuzzyjoin)
df %>%
  fuzzy_left_join(badData, #join badData to df
                  by = c("wavelength" = "start", #variables to join by
                       "wavelength" = "end"),
                  match_fun=list(`>=`, `<=`)) #functions to use for each par of variables so "wavelength" >= "start" and "wavelength" <= "end" is the logic here
#output
   wavelength reflectance  bin start    end
1    300.0000   -1.000000    a   296 310.01
2    305.0087   -6.008667    a   296 310.01
3    310.0173  -11.017334 <NA>    NA     NA
4    315.0260  -16.026001 <NA>    NA     NA
5    320.0347  -21.034668 <NA>    NA     NA
6    325.0433  -26.043335 <NA>    NA     NA
7    330.0520  -31.052002    b   330 335.00
8    335.0607  -36.060669 <NA>    NA     NA
9    340.0693  -41.069336 <NA>    NA     NA
10   345.0780  -46.078003 <NA>    NA     NA
11   350.0867  -51.086670    c   350 565.00
12   355.0953  -56.095337    c   350 565.00

Thanks. But I was looking for a solution based on lapply, similar to the link that I had shared — ashleych, Aug 29 '18 at 09:03
@ashleych "*But I was looking for a solution based on lapply*" Why? This is a very elegant and succinct solution. You don't need `lapply`! — Maurits Evers, Aug 29 '18 at 09:05
Agreed, and I’ve upvoted it too. But my motivation for the question was to understand if the lapply construct referred to can be extended to solve this. — ashleych, Aug 29 '18 at 09:43

Roland · Answer 2 · 2018-08-29T10:08:18.313

You don't need a loop. You can simply use cut:

badData <- data.frame(bin=c('a','b','c'),start= c(296,330,350),end=c(310.01,335,565))
df <- data.frame(wavelength = seq(300,360,5.008667),
                 reflectance = seq(-1,-61,-5.008667))

df$bins <- cut(df$wavelength, t(badData[, c("start", "end")]), 
               labels = head(c(t(cbind(as.character(badData$bin), "good"))), -1))
#   wavelength reflectance bins
#1    300.0000   -1.000000    a
#2    305.0087   -6.008667    a
#3    310.0173  -11.017334 good
#4    315.0260  -16.026001 good
#5    320.0347  -21.034668 good
#6    325.0433  -26.043335 good
#7    330.0520  -31.052002    b
#8    335.0607  -36.060669 good
#9    340.0693  -41.069336 good
#10   345.0780  -46.078003 good
#11   350.0867  -51.086670    c
#12   355.0953  -56.095337    c

You haven't said which side of the intervals should be open or closed, but this can be adjusted.

It throws an error 'Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, : factor level [4] is duplicated' — ashleych, Aug 29 '18 at 09:35

Use of lapply to identify which bin a particular value lies in

2 Answers2