Capitalize the first word of a sentence (regex, gsub, gregexpr)

Question

Suppose I have the following text:

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")

I want to capitalize the first alphabetical character of a sentence.

I figured out the regular expression to match as: ^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]

A call to gregexpr returns:

> gregexpr("^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]", txt)
[[1]]
[1]   1  16  65  75 104 156
attr(,"match.length")
[1] 0 7 7 8 7 8
attr(,"useBytes")
[1] TRUE

Which are the correct substring indices that match.

However, how do I implement this to properly capitalize the characters I need? I'm assuming I have to strsplit and then... ?

i don't know anything about r, sorry, but you'd usually get the first character, cap that and then concat to the [1:] (substring containing the rest of the string).. — UIlrvnd, Apr 10 '14 at 00:33

score 4 · Accepted Answer · edited May 23 '17 at 10:33

It appears that your regex did not work for your example, so I stole one from this question.

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")
print(txt)

gsub("([^.!?\\s])([^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?)(?=\\s|$)", "\\U\\1\\E\\2", txt, perl=T, useBytes = F)

score 1 · Answer 2 · answered Nov 27 '14 at 18:10

Using rex may make this type of task a little simpler. This implements the same regex that merlin2011 used.

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")

re <- rex(
  capture(name = 'first_letter', alnum),
  capture(name = 'sentence',
    any_non_puncts,
    zero_or_more(
      group(
        punct %if_next_isnt% space,
        any_non_puncts
        )
      ),
    maybe(punct)
    )
  )

re_substitutes(txt, re, "\\U\\1\\E\\2", global = TRUE)
#>[1] "This is just a test! I'm not sure if this is O.K. Or if it will work? Who knows. Regex is sorta new to me..  There are certain cases that I may not figure out??  Sad!  ^_^"

This is really helpful!! I had no idea this existed. – Ray Dec 01 '14 at 19:59 — Ray, Dec 01 '14 at 19:59

Capitalize the first word of a sentence (regex, gsub, gregexpr)

2 Answers2