1

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:

orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"

What I need as an outcome is:

c("answer2","answer3")

I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.

I have tried to match() the result to the orig, but I would need to do that with all substrings.

There has to be an easy solution, but I haven't found it.

Sotos
  • 51,121
  • 6
  • 32
  • 66
  • 2
    It's not necessarily a reversible process. For example if your origin set is c("ab", "cde", "abc", "de") then you simply can't know if the string "abcde" was the result of ("ab" and "cde") or ("abc" and "de"). Would you be happy with a solution that lists all four of those as options? If so, I should be able to propose something. (In real world cases this may or may not matter - for lists of single words it certainly *would* matter - this is one reason translation of some ancient languages is difficult because they didn't use spaces!) – Mike S Sep 06 '18 at 13:57
  • Will this do? `unlist(strsplit(result, "(?<=[\\d+])", perl = TRUE))` - Taken from [this answer](https://stackoverflow.com/a/21493089/5635580) – Sotos Sep 06 '18 at 14:00
  • *"I have tried to match() the result to the orig"*: you mean `orig` is available? – Stéphane Laurent Sep 06 '18 at 14:03

2 Answers2

0

What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:

FindSubstrings <- function(orig, result){
  orig[sapply(orig, grepl, result)]
}

In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:

  • fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
  • some match patterns may contain others, for example, "answer10" contains "answer1"
  • stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.
Liudvikas Akelis
  • 1,164
  • 8
  • 15
  • your answer is elegant, and I think in most cases answers will not contain one another for me, but I will go safe and add characters to the input for separation. – kristofkelemen Sep 07 '18 at 04:03
0
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)

This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?

Aaron Hayman
  • 525
  • 3
  • 11