3

Suppose I have the following text:

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")

I want to capitalize the first alphabetical character of a sentence.

I figured out the regular expression to match as: ^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]

A call to gregexpr returns:

> gregexpr("^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]", txt)
[[1]]
[1]   1  16  65  75 104 156
attr(,"match.length")
[1] 0 7 7 8 7 8
attr(,"useBytes")
[1] TRUE

Which are the correct substring indices that match.

However, how do I implement this to properly capitalize the characters I need? I'm assuming I have to strsplit and then... ?

Ray
  • 3,137
  • 8
  • 32
  • 59
  • i don't know anything about r, sorry, but you'd usually get the first character, cap that and then concat to the [1:] (substring containing the rest of the string).. – UIlrvnd Apr 10 '14 at 00:33
  • The first related question gives you the r-specific info. – UIlrvnd Apr 10 '14 at 00:45

2 Answers2

4

It appears that your regex did not work for your example, so I stole one from this question.

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")
print(txt)

gsub("([^.!?\\s])([^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?)(?=\\s|$)", "\\U\\1\\E\\2", txt, perl=T, useBytes = F)
Community
  • 1
  • 1
merlin2011
  • 71,677
  • 44
  • 195
  • 329
1

Using rex may make this type of task a little simpler. This implements the same regex that merlin2011 used.

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")

re <- rex(
  capture(name = 'first_letter', alnum),
  capture(name = 'sentence',
    any_non_puncts,
    zero_or_more(
      group(
        punct %if_next_isnt% space,
        any_non_puncts
        )
      ),
    maybe(punct)
    )
  )

re_substitutes(txt, re, "\\U\\1\\E\\2", global = TRUE)
#>[1] "This is just a test! I'm not sure if this is O.K. Or if it will work? Who knows. Regex is sorta new to me..  There are certain cases that I may not figure out??  Sad!  ^_^"
Jim
  • 4,687
  • 29
  • 30