3

I have a string s where "substrings" are divided by a pipe. Substrings might or might not contain numbers. And I have a test character string n that contains a number and might or might not contain letters. See example below. Note that spacing can be any

I'm trying to drop all substrings where n is not in a range or is not an exact match. I understand that I need to split by -, convert to numbers, and compare low/high to n converted to numeric. Here's my starting point, but then I got stuck with getting the final good string out of unl_new.

s = "liquid & bar soap 1.0 - 2.0oz | bar 2- 5.0 oz | liquid soap 1-2oz | dish 1.5oz"
n = "1.5oz"

unl = unlist(strsplit(s,"\\|"))

unl_new = (strsplit(unl,"-"))
unl_new = unlist(gsub("[a-zA-Z]","",unl_new))

Desired output:

"liquid & bar soap 1.0 - 2.0oz | liquid soap 1-2oz | dish 1.5oz"

Am I completely on the wrong path? Thanks!

Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39

3 Answers3

2

Don't know if it is general enough, but you might try:

require(stringr)
splitted<-strsplit(s,"\\|")[[1]]
ranges<-lapply(strsplit(
          str_extract(splitted,"[0-9\\.]+(\\s*-\\s*[0-9\\.]+|)"),"\\s*-\\s*"),
          as.numeric)
tomatch<-as.numeric(str_extract(n,"[0-9\\.]+"))
paste(splitted[
            vapply(ranges, function(x) (length(x)==1 && x==tomatch) || (length(x)==2 && findInterval(tomatch,x)==1),TRUE)],
             collapse="|")
#[1] "liquid & bar soap 1.0 - 2.0oz | liquid soap 1-2oz | dish 1.5oz"
nicola
  • 24,005
  • 3
  • 35
  • 56
2

Here an option using r-base ;

## extract the n numeric
nn <- as.numeric(gsub("[^0-9|. ]", "", n))
## keep only numeric and -( for interval)
## and split by |
## for each interval test the condition to create a boolean vector
contains_n <- sapply(strsplit(gsub("[^0-9|. |-]", "", s),'[|]')[[1]],
       function(x){
         yy <- strsplit(x, "-")[[1]]
         yy <- as.numeric(yy[nzchar(yy)])
         ## the condition
         (length(yy)==1 && yy==nn) || length(yy)==2 && nn >= yy[1] && nn <= yy[2]
       })

## split again and use the boolean factor to remove the parts 
## that don't respect the condition
## paste the result using collapse to get a single character again
paste(strsplit(s,'[|]')[[1]][contains_n],collapse='')

## [1] "liquid & bar soap 1.0 - 2.0oz  liquid soap 1-2oz  dish 1.5oz"
agstudy
  • 119,832
  • 17
  • 199
  • 261
2

Here's a method starting from your unl step using stringr:

unl = unlist(strsplit(s,"\\|"))
n2 <- as.numeric(gsub("[[:alpha:]]*", "", n))
num_lst <- str_extract_all(unl, "\\d\\.?\\d*")
indx <- lapply(num_lst, function(x) {
  if(length(x) == 1) {isTRUE(all.equal(n2, as.numeric(x))) 
  } else {n2 >= as.numeric(x[1]) & n2 <= as.numeric(x[2])}})

paste(unl[unlist(indx)], collapse=" | ")
[1] "liquid & bar soap 1.0 - 2.0oz  |  liquid soap 1-2oz  |  dish 1.5oz"

I also tested it with other amounts like "2.3oz". With n2 we coerce n to numeric for comparison. The variable num_lst isolates the numbers from the character string.

With indx we apply our comparisions over the string numbers. if there is one number we check if it equals n2. I chose not to use the basic == operator to avoid any rounding issues. Instead isTRUE(all.equal(x, y)) is used.

Finally, the logical index variable indx is used to subset the character string to extract the matches and paste them together with a pipe "|".

Pierre L
  • 28,203
  • 6
  • 47
  • 69