5

I have several years worth of data on individuals, but their names are formatted differently each year. Half of the names are already in "First Last" order but I can't figure out how to successfully edit the other half ("Last, First").

Here's a sample df:

name <- c("First1 Last1","Last2, First2", "Last3, First3", "First4 Last4", "First5 Last5")
salary <-c(51000, 72000,125000,67000,155000)
year <-c(2012,2014,2013,2013,2014)

df <- data.frame(name, salary, year, stringsAsFactors=FALSE)

Here are things I've tried: split up the text by comma:

df$name2 <- strsplit(df$name, ", ") #to split the character string by comma
df$name3 <-paste(df$name2, collapse=" ") #to collapse the newly created vectors back into a string
df$name4 <-paste(rev(df$name2)) #to try pasting each vector in reverse order
df$name5 <-paste(rev(df$name2)[2:1]) #trying again...

I've printed the correct names, but backwards, and printed them on the wrong rows, but despite all googling I can't get it to work correctly. What am I doing wrong?

SabDeM
  • 7,050
  • 2
  • 25
  • 38
jesstme
  • 604
  • 2
  • 10
  • 25

2 Answers2

7

You can use a regular expression:

df$name <- sub("(L[A-Za-z0-9]+).*\\s+(F[A-Za-z0-9]+).*","\\2 \\1",df$name)

# df
#           name salary year
# 1 First1 Last1  51000 2012
# 2 First2 Last2  72000 2014
# 3 First3 Last3 125000 2013
# 4 First4 Last4  67000 2013
# 5 First5 Last5 155000 2014

The code looks for a word beginning with an uppercase L, followed by some letters / digits, then by some symbols, a space, then a word beginnign with an uppercase F, some letters / digits and then some symbols.

It then reorders the two words by putting first the one beginning with an F (that is, (F[A-Za-z0-9]+)), then the one beginning with an L ( that is, (L[A-Za-z0-9]+)).

As you can see, the code removes the comma (it seems to be your desired output).

With the new info, use the code :

df$name <- sub('(.*)\\,\\s+(.*)','\\2 \\1', df$name)

# sub('(.*)\\,\\s+(.*)','\\2 \\1',name)
# [1] "John Smith"       "Marcus Green"     "Mario Sanchez"    "Jennifer Roberts" "Sammy Lee"

Here, we are looking for characters before a comma, followed by a space and then by other characters. We then reorder the first and the second group to have the desired output.

Note: I assumed that if there is no comma, then the names are already in the correct order (that seems to be the case in your comment).

etienne
  • 3,648
  • 4
  • 23
  • 37
  • I should've been more clear--my real dataframe has actual names, not the words First & Last. A better example would've been: name <-c("Smith, John", "Marcus Green", "Sanchez, Mario", "Roberts, Jennifer", "Sammy Lee") – jesstme Oct 25 '16 at 22:04
  • Very brilliant solution! – SabDeM Oct 25 '16 at 22:23
  • Thank you, this worked great! Are you (or anyone) able to point me to resources to better understand how you wrote the code? – jesstme Oct 25 '16 at 22:33
  • 1
    @Jesster : you're welcome ! You can use ?sub to display the help for the function and in the help there is a link pointing to the details of regular expressions in R. I think you can also access the help directly by typing ?strsplit but as I don't have my computer at the time I can't be sure. – etienne Oct 25 '16 at 22:37
  • 1
    @Jesster : you can use ?regex as shown [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html) – etienne Oct 25 '16 at 22:39
2

I think this is what you want. You were really close, you need both a rev and a paste(..., collapse = " "). I also trim whitespace, but that may not be necessary.

# look for commas to see which rows need fixing
needs_rearranging = grep(",", df$name)
df$name[needs_rearranging] = 
           # split on the comma space, then
    sapply(strsplit(df$name[needs_rearranging], split = ", "),
       function(x) {
           # remove whitespace, reverse the order, and 
           # paste them back together
           paste(rev(trimws(x)), collapse = " ")
       })

df
#           name salary year
# 1 First1 Last1  51000 2012
# 2 First2 Last2  72000 2014
# 3 First3 Last3 125000 2013
# 4 First4 Last4  67000 2013
# 5 First5 Last5 155000 2014
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294