Partial regex results in R

Question

Would like to know the error while using str_replace_all while doing a transformation on a string:

abc <- "Good Product ...but it's darken the skin tone..why...?"

I would like to do an additional manipulation in order to enable convert it to something like below, before running sentence tokenization using quanteda:

abc_new <- "Good Product. But it's darken the skin tone. Why?"

I am using the following regex to enable this:

str_replace_all(abc,"\\.{2,15}[a-z]{1}", paste(".", toupper(str_extract_all(str_extract_all(abc,"\\.{2,15}[a-z]{1}"),"[a-z]{1}")[[1]])[[1]], collapse = " "))

However this throws: "Good Product. Cut it's darken the skin tone. Chy...?"

Can someone suggest a solution for this?

You have really complex replacement. What are you trying to do? Replace all multiple dots with consecutive letter to one dot and capital letter? — m0nhawk, Jan 16 '18 at 11:58
I don't know what you are trying to do, but "\\.{2,15}[a-z]{1}" does the following: match 2-15 literal `.`s, followed by a single lowercase letter (or more, because the regex then ends and doesn't care about what comes after. To make sure it's *just 1*, add a `$` (line ending) or `[^a-z]` at the end.) — PixelMaster, Jan 16 '18 at 12:14
Yes m0nhawk, the objective is to find out instances of multiple dots followed by a letter and replace it with a single dot and the upper case of that letter. I think once this is done, only then quanteda would be able to recognize these as separate sentences when asked for tokenization. — AJosh, Jan 16 '18 at 12:35

score 1 · Answer 1 · answered Jan 16 '18 at 13:47

It's really, really difficult to read and understand the replacement code you provided, given how long and nested it is.

I would try to break down the complex pattern to smaller and traceable ones, that I can easily debug. One can do that either by assigning the intermediate results to temporary variables, or using the pipe-operator:

library(magrittr)
string <- "Good Product ...but it's darken the skin tone..why...?"
string %>% 
  gsub("\\.+\\?", "?", .) %>%   # Remove full-stops before question marks
  gsub("\\.+", ".", .) %>%      # Replace all multiple dots with a single one
  gsub(" \\.", ".", .) %>%      # Remove space before dots
  gsub("(\\.)([^ ])", ". \\2", .) %>%  # Add a space between the full-stop and the next sentance 
  gsub("(\\.) ([[:alpha:]])", ". \\U\\2", ., perl=TRUE) # Replace first letter after the full-stop with it's upper caps

  # [1] "Good Product. But it's darken the skin tone. Why?"

Thank you Deena and Mark. Makes sense to process them in smaller steps for ease of understanding and debugging — AJosh, Jan 22 '18 at 16:24

score 1 · Answer 2 · answered Jan 16 '18 at 14:32

It seems like you're trying to match a pattern to remove, using a part of what you want to keep in that pattern. In regular expressions you can use () to flag a portion of pattern to use in the replacement.

Consider in your case:

abc <- "Good Product ...but it's darken the skin tone..why...?"
step1 <- gsub(" ?\\.+([a-zA-Z])",". \\U\\1",abc,perl=TRUE)
step1
#> [1] "Good Product. But it's darken the skin tone. Why...?"

The matching expressions breaks down as:

 ?         #Optionally match a space (to handle the space after Good Product)
\\.+       #Match at least one period
([a-zA-Z]) #Match one letter and remember it

The replacement pattern

.       #Insert a period followed by a space
\\U     #Insert an uppercase version...
   \\1    #of whatever was matched in the first set of parenthesis

Now, this doesn't fix the ellipses followed by a question mark. A followup match can fix this.

step2 = gsub("\\.+([^\\. ])","\\1",step1)
step2
#> [1] "Good Product. But it's darken the skin tone. Why?"

Here we're matching

\\.+      #at least one period
([^\\. ]) #one character that is not a period or a space and remember it

Replacing with

\\1 #The thing we remembered

So, two steps, two fairly generic regular expressions that should extend to other use cases as well.

Partial regex results in R

2 Answers2