Selecting number from string based on criteria

Question

I have the following data set:

PATH = c("5-8-10-8-17-20",
         "56-85-89-89-0-15-88-10",
         "58-85-89-65-49-51")
INDX = c(18, 89, 50)

data.frame(PATH, INDX)

PATH	INDX
5-8-10-8-17-20	18
56-85-89-89-0-15-88-10	89
58-85-89-65-49-51	50

The column PATH has strings that represent a numerical series and I want to be able to pick the largest number from the string that satisfies PATH <= INDX, that is selecting a number from PATH that is equal to INDX or the largest number from PATH that is yet less than INDX

my desired output would look like this:

PATH	INDX	PICK
5-8-10-8-17-20	18	17
56-85-89-89-0-15-88-10	89	88
58-85-89-65-49-51	50	49

Some of my thought-process behind the answer:

I know that If I have a function such strsplit I could separate each string by "-", arrange by number and then subtract with INDX and thus select the smallest negative number or zero. However, the original dataset is quite large and I wonder if there is a faster or more efficient way to perform this task.

What happens if there is no number smaller than INDX? – s_baldur Aug 05 '22 at 13:16 — s_baldur, Aug 05 '22 at 13:16
then it will be ok to return NA, thanks! – R_Student Aug 05 '22 at 13:27 — R_Student, Aug 05 '22 at 13:27

zephryl · Answer 1 · 2022-08-05T13:35:52.457

4

Using purrr::map2_dbl():

library(purrr)

PICK <- map2_dbl(
  strsplit(PATH, "-"),
  INDX,
  ~ max(
    as.numeric(.x)[as.numeric(.x) <= .y]
  )
)

# 17 89 49

edited Aug 05 '22 at 13:35

answered Aug 05 '22 at 13:13

zephryl

14,633
3
11
30

2

be careful with `.x <= .y` when `.x` is still a string. – s_baldur Aug 05 '22 at 13:15
@sindri_baldur good catch, thanks — I edited accordingly. – zephryl Aug 05 '22 at 13:36

score 4 · Answer 2 · answered Aug 05 '22 at 13:13

4

Another option:

mapply(
  \(x, y) max(x[x <= y]),
  strsplit(PATH, "-") |> lapply(as.integer),
  INDX
)

# [1] 17 88 49

answered Aug 05 '22 at 13:13

s_baldur

29,441
4
36
69

gaut · Accepted Answer · 2022-08-05T13:10:01.850

1

The below should be reasonably efficient, there is nothing wrong with your approach.

numpath <- sapply(strsplit(PATH, "-"), as.numeric)
maxindexes <- lapply(1:length(numpath), function(x) which(numpath[[x]] <= INDX[x]))
result <- sapply(1:length(numpath), function(x) max(numpath[[x]][maxindexes[[x]]]))
> result
[1] 17 89 49

edited Aug 05 '22 at 13:10

answered Aug 05 '22 at 13:00

gaut

5,771
1
14
45

hey gaut! thank you so much! would you please be so kind to let me know whats the difference btw sapply and lapply in the context of your code? – R_Student Aug 05 '22 at 13:06
1

sure, `lapply` always returns a list. `sapply` will try to simplify the result, here it returns just one vector with the numeric result figures. – gaut Aug 05 '22 at 13:08

score 1 · Answer 4 · answered Aug 05 '22 at 13:16

Using dplyr

library(dplyr)

df |> 
  rowwise() |> 
  mutate(across(PATH, ~ {
    a =  unlist(strsplit(.x, split = "-"))
   max(as.numeric(a)[which(as.numeric(a) <= INDX)])
   },  .names = "PICK"))

 PATH                    INDX  PICK
  <chr>                  <dbl> <dbl>
1 5-8-10-8-17-20            18    17
2 56-85-89-89-0-15-88-10    89    89
3 58-85-89-65-49-51         50    49

score 1 · Answer 5 · answered Aug 05 '22 at 13:17

You can create a custom function like below:

my_func <- function(vec1, vec2) {
  
  sort(as.numeric(unlist(strsplit(vec1, split = "-")))) -> x
  return(x[max(cumsum(x <= vec2))])
  
}


df$PICK <- sapply(seq_len(nrow(df)), function(i) my_func(df$PATH[i], df$INDX[i]))

which will yield the following output:

# PATH INDX PICK
# 1         5-8-10-8-17-20   18   17
# 2 56-85-89-89-0-15-88-10   89   89
# 3      58-85-89-65-49-51   50   49

Selecting number from string based on criteria

5 Answers5