5

I've got a vector of (human) names, all in capitals:

names <- c("FRIEDRICH SCHILLER", "FRANK O'HARA", "HANS-CHRISTIAN ANDERSEN")

To decapitalize (capitalize the first letters only) so far, I was using

simpleDecap <- function(x) {
  s <- strsplit(x, " ")[[1]] 
  paste0(substring(s, 1,1), tolower(substring(s, 2)), collapse=" ")
  }
sapply(names, simpleDecap, USE.NAMES=FALSE)
# [1] "Friedrich Schiller"         "Frank O'hara"         "Hans-christian Andersen"

But I also want to account for for ' and -. Using s <- strsplit(x, " |\\'|\\-")[[1]] of course finds the right letters, but then in the collapse ' and - get lost. Hence, I tried

simpleDecap2 <- function(x) {
  for (char in c(" ", "\\-", "\\'")){
    s <- strsplit(x, char)[[1]] 
    x <-paste0(substring(s, 1,1), tolower(substring(s, 2)), collapse=char)
  } return x
  }

sapply(names, simpleDecap, USE.NAMES=FALSE)

but that's even worse, of course, as the results are split one after the other:

sapply(names, simpleDecap2, USE.NAMES=FALSE)
# [1] "Friedrich schiller"      "Frank o'Hara"            "Hans-christian andersen"

I think the right approach splits according s <- strsplit(x, " |\\'|\\-")[[1]], but the paste= is the problem.

MERose
  • 4,048
  • 7
  • 53
  • 79

2 Answers2

6

This seems to work, using Perl compatible regular expressions:

gsub("\\b(\\w)([\\w]+)", "\\1\\L\\2", names, perl = TRUE)

\L transforms the following match group to lower case.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 2
    love the conciseness *and* incomprehensibility of perl-regexp :-) – Carl Witthoft Sep 24 '15 at 13:04
  • @CarlWitthoft I really disagree with this. Like any other language, they have to be learned, but for what they express they are very simple and comprehensible. The general claim that regex are incomprehensible is a huge canard: equivalent, manual parsing code is almost always more complex and harder to understand. Using the word “incomprehensible” for this is really misleading. – Konrad Rudolph Sep 24 '15 at 14:04
  • 1
    It's a humorous dig at the meme that perl itself allows programming style that makes the obfuscated c-code contest pale by comparison. Of course regex is *the* way to go for string manipulation. – Carl Witthoft Sep 24 '15 at 15:37
  • 1
    Going off-course from the posted question - suppose there's an input like `FRED BOUchARD` , or worse, `IAN MACDONALD` (Ian MacDonald). – Carl Witthoft Sep 24 '15 at 16:08
  • 3
    @CarlWitthoft Your “Fred Bouchard“ case works (or is this supposed to do something odd to the casing?). Stuff like “MacDonald” is hard (impossible?) to get right without hard-coding such prefixes. – Konrad Rudolph Sep 24 '15 at 16:15
0

Although I agree that Perl regexp is the better solution, but the simpleDecap2 approach is not that far away from working.

simpleDecap3 <- function(x) {
    x <- tolower(x)
    for (char in c(" ", "-", "'")){
        s <- strsplit(x, char)[[1]] 
        x <-paste0(toupper(substring(s, 1,1)), substring(s, 2), collapse=char)
    } 
    x
}

That is, turn the whole name to lower case and then capitalize the first letter after " ", "-", or "'". Not as nice-looking as the regexp and most likely not as robust, but it gets the job done with just minor changes from your original code.

DGKarlsson
  • 1,091
  • 12
  • 18