I am cleaning some strings in R and I need to split them to recover information from two substrings that do not belong with each other. The problem is that, there is no real pattern for me to split all the strings with. Rather, I know what the different substrings I am looking for are, and I wish to use these as a pattern to perform the split without losing this pattern itself in the process.
Let's say that a sample of strings are of the form:
test <- c("Some string that explains x. Conflict",
"Some string that explains y. Additional information. Precaution",
"Some string that explains z. Justification. Conflict")
I wish to split those strings into the following list:
[1] "Some string that explains x."
[2] "Conflict"
[3] "Some string that explains y. Additional information."
[4] "Precaution"
[5] "Some string that explains z. Justification."
[6] "Conflict"
At the center of my problem is I need to conserve the order.
Obviously, the pattern I mentioned is:
pattern <- c("Conflict", "Precaution")
Most of the strings that I had initially had a double space between the explanatory part and the so-called pattern so I could simply use
unlist(strsplit(test, "\\s{2,}"))
to differentiate them. I now realize that some of them have only one space between them, and this method could no longer function since the explanatory string would end up being divided for each of its individual words.
Extracting them was an option which I looked up but as I tried, I would lose the order I must preserve (I end up creating a new list with solely the extracted substring).
With strsplit()
, I cannot use the said pattern for the function since by splitting the string with it, I remove the pattern itself. I tried to use a gsub()
trick I found to surround the pattern with "~" and then split it accordingly but I found myself unsuccessful.
Namely,
> unlist(strsplit(test, pattern))
[1] "Some string that explains x. "
[2] "Some string that explains y. Additional information. "
[3] "Some string that explains z. Justification. "
Essentially, how I could split the strings using the said pattern and get the desired result? Alternatively, is there a way to extract the pattern from the original strings and insert them in the list in the proper order?