4

I am cleaning some strings in R and I need to split them to recover information from two substrings that do not belong with each other. The problem is that, there is no real pattern for me to split all the strings with. Rather, I know what the different substrings I am looking for are, and I wish to use these as a pattern to perform the split without losing this pattern itself in the process.

Let's say that a sample of strings are of the form:

test <- c("Some string that explains x. Conflict", 
          "Some string that explains y. Additional information. Precaution",
          "Some string that explains z. Justification.   Conflict") 

I wish to split those strings into the following list:

[1] "Some string that explains x."
[2] "Conflict"
[3] "Some string that explains y. Additional information."
[4] "Precaution"
[5] "Some string that explains z. Justification."
[6] "Conflict"

At the center of my problem is I need to conserve the order.

Obviously, the pattern I mentioned is:

pattern <- c("Conflict", "Precaution")

Most of the strings that I had initially had a double space between the explanatory part and the so-called pattern so I could simply use

unlist(strsplit(test, "\\s{2,}"))

to differentiate them. I now realize that some of them have only one space between them, and this method could no longer function since the explanatory string would end up being divided for each of its individual words.

Extracting them was an option which I looked up but as I tried, I would lose the order I must preserve (I end up creating a new list with solely the extracted substring).

With strsplit(), I cannot use the said pattern for the function since by splitting the string with it, I remove the pattern itself. I tried to use a gsub() trick I found to surround the pattern with "~" and then split it accordingly but I found myself unsuccessful.

Namely,

 > unlist(strsplit(test, pattern))
[1] "Some string that explains x. "                        
[2] "Some string that explains y. Additional information. "
[3] "Some string that explains z. Justification.   "

Essentially, how I could split the strings using the said pattern and get the desired result? Alternatively, is there a way to extract the pattern from the original strings and insert them in the list in the proper order?

pissall
  • 7,109
  • 2
  • 25
  • 45
Bora Dora
  • 41
  • 3
  • 2
    Are 'Conflict' and 'Precaution' the only words you want to look for like this? And does anything else ever appear at the end of a string that you specifically would not want to look for? – Hayden Y. Sep 20 '19 at 23:29
  • @HaydenY. I realize through your question that I should have been even more precise. I have more words to look for (approximately 10). In reality, the data I'm sorting has over 20 000 strings, and I honestly don't know whether something could appear at the end of a string which I would not want. This is actually what motivates the question, because I know exactly the pattern and I know I wish to retrieve it. – Bora Dora Sep 21 '19 at 11:25

4 Answers4

2

If you combine the two patterns into one element patt by separating them with '|', that new pattern will match either of the two original patterns in the test vector. Then using str_remove can get you the part without the pattern, and using str_extract gives the part matching one of the patterns. Now you can interlace these two vectors into a single one using the pattern c(rbind(x, y))*. This will be less computationally efficient than using regex directly to get both the non-pattern and pattern parts I assume.

Note: All this assumes the pattern you want to extract is just "Conflict" or "Precaution" and that they could show up anywhere in the strings. This is different from the logic in some other answers which are not identifying those two words but instead identifying the last part of the string. Not completely clear to me which you wanted so just FYI on the difference.

library(stringr)
patt <- paste(pattern, collapse = '|')
c(rbind(str_remove(test, patt), str_extract(test, patt)))

# [1] "Some string that explains x. "                        
# [2] "Conflict"                                             
# [3] "Some string that explains y. Additional information. "
# [4] "Precaution"                                           
# [5] "Some string that explains z. Justification.   "       
# [6] "Conflict" 

* See example below. This works because c will convert the matrix to a vector column-wise and you are creating the matrix with one element from each vector per column by rbind-ing the vectors together.

c(rbind(c('a', 'b', 'c'), c('A', 'B', 'C')))
#[1] "a" "A" "b" "B" "c" "C"
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
1

An option is to split at the last space. Here we use regex lookaround i.e. to match one or more spaces (+) that succeeds a . (?<=\\.) and precedes one or more non-white space characters (\\S+) till the end ($) of the string

library(tidyr)
library(tibble)
tibble(test) %>%
     separate_rows(test,  sep="(?<=\\.) +(?=\\S+$)")
# A tibble: 6 x 1
#  test                                                
#  <chr>                                               
#1 Some string that explains x.                        
#2 Conflict                                            
#3 Some string that explains y. Additional information.
#4 Precaution                                          
#5 Some string that explains z. Justification.         
#6 Conflict                                            

Or using the same regex in base R

unlist(strsplit(test, "(?<=\\.) +(?=\\S+$)", perl = TRUE))

If there is a specific vector of words before that we need the split, create the regex based on that vector

pat <- paste0("\\s+(?=\\b(", paste(pattern, collapse="|"), ")\\b)")

and use that in strsplit

unlist(strsplit(test, pat, perl = TRUE))
#[1] "Some string that explains x."              
#[2] "Conflict" 
#[3] "Some string that explains y. Additional information."
#[4] "Precaution"                                          
#]5] "Some string that explains z. Justification." 
#[6] "Conflict"                          
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Another would be splitting at the last .:

unlist(strsplit(test, "\\.\\s*(?=[^\\.]+$)", perl=TRUE))

# [1] "Some string that explains x"                         "Conflict" 
# [3] "Some string that explains y. Additional information" "Precaution"
# [5] "Some string that explains z. Justification"          "Conflict" 
M--
  • 25,431
  • 8
  • 61
  • 93
0

In light of the fact that you may have cases you don't want to catch, here's what I would suggest:

test <- c("Some string that explains x. Conflict",
          "Some string that explains y. Additional information. Precaution",
          "Some string that explains z. Justification.   Conflict",
          "A String You Don't Want Conflict",
          "Another string you don't want that ends with a single word.  Word" )

pattern <- c("Conflict", "Precaution") # Plus the other ~8 words you want
pattern.regex<-paste0("(\\.|\\?|!)\\s+(", paste(pattern, collapse="|"), ")$") # Pattern for punctuation that ends a sentence, one or more spaces, the words you want, and the end of a string

test2<-test[grep(pattern.regex, test, perl=T)] # A version of test without irrelevant values

And then you can just split each string in test2 as in akrun's answer (without needing to specify specific words, since we're already restricted test2 to only contain cases ending in one of your desired words.

unlist(strsplit(test2, "(?<=\\.) +(?=\\S+$)", perl = TRUE))

That said, there are more things you may want to consider, such as

  • Can words like 'Conflict' have a period after?
  • Do they have to begin with uppercase letters, or can they be all lowercase/uppercase?
  • Do you want cases like the fourth element of test, where there's no period at the end of the segment before the final word?

Ultimately, my advice would be to try out the above and do a little digging into your dataset to see if the results are too broad or too narrow. But this at least gets across the basic idea, and provides for some level of uncertainty with regard to how your raw data looks.

Hayden Y.
  • 448
  • 2
  • 8