0

I'm new to R and practicing using the Titanic data set from Kaggle. I am attempting to separate last name, first name, salutation, and extra information into separate columns so that I can try to categorize the age of the passengers - adult or child.

The following is sample data from the Train data set:

head(traindf,5)
# Source: local data frame [5 x 12]
# 
# PassengerId Survived Pclass
# 1           1        0      3
# 2           2        1      1
# 3           3        1      3
# 4           4        1      1
# 5           5        0      3
# Variables not shown: Name (chr), Sex (fctr), Age (dbl), SibSp (int), Parch
# (int), Ticket (fctr), Fare (dbl), Cabin (fctr), Embarked (fctr)

The following is a sample that includes the Name:

select(traindf,Survived,Pclass,Name,Sex)
# Source: local data frame [891 x 4]
# 
# Survived Pclass                                                Name    Sex
# 1         0      3                             Braund, Mr. Owen Harris   male
# 2         1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female
# 3         1      3                              Heikkinen, Miss. Laina female
# 4         1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female
# 5         0      3                            Allen, Mr. William Henry   male
# 6         0      3                                    Moran, Mr. James   male
# 7         0      1                             McCarthy, Mr. Timothy J   male
# 8         0      3                      Palsson, Master. Gosta Leonard   male
# 9         1      3   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female
# 10        1      2                 Nasser, Mrs. Nicholas (Adele Achem) female

I can use the following code to separate last name from the rest of the column:

require(tidyr) # for the separate() function

traindfnames <- traindf %>%
  separate(Name, c("Lastname","Salutation"), sep = ",")

traindfnames 
# Source: local data frame [891 x 13]
# 
# PassengerId Survived Pclass  Lastname
# 1            1        0      3    Braund
# 2            2        1      1   Cumings
# 3            3        1      3 Heikkinen
# 4            4        1      1  Futrelle
# 5            5        0      3     Allen
# 6            6        0      3     Moran
# 7            7        0      1  McCarthy
# 8            8        0      3   Palsson
# 9            9        1      3   Johnson
# 10          10        1      2    Nasser
# ..         ...      ...    ...       ...
# Variables not shown: Salutation (chr), Sex (fctr), Age (dbl), SibSp (int),
# Parch (int), Ticket (fctr), Fare (dbl), Cabin (fctr), Embarked (fctr)

However, when I try to add a field for First Name:

traindfnames <- traindf %>%
separate(Name, c("Lastname","Salutation","firstname"), sep =",,")

I get this error:

# Error: Values not split into 3 pieces at 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 2

Am I using incorrect syntax or 3 fields from one column isn't possible?

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
SandraK
  • 1
  • 1
  • 3

1 Answers1

1

Having looked at this data, I think the easiest way to do it is using something like str_match() from package stringr. If you assume data$Name is in the form "[Lastname], [Salutation]. [Firstname]" the regular expression to match this is

str_match(data$Name, "([A-Za-z]*),\\s([A-Za-z]*)\\.\\s(.*)")
#      [,1]                                                  [,2]        [,3]   [,4]                                   
# [1,] "Braund, Mr. Owen Harris"                             "Braund"    "Mr"   "Owen Harris"                          
# [2,] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Cumings"   "Mrs"  "John Bradley (Florence Briggs Thayer)"
# [3,] "Heikkinen, Miss. Laina"                              "Heikkinen" "Miss" "Laina"                                
# [4,] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"        "Futrelle"  "Mrs"  "Jacques Heath (Lily May Peel)"        
# [5,] "Allen, Mr. William Henry"                            "Allen"     "Mr"   "William Henry"                        
# [6,] "Moran, Mr. James"                                    "Moran"     "Mr"   "James" 

So you need to add columns 2 to 4 above to your original data frame. I am not sure you can do this with separate actually. Writing

separate(data, Name, c("Lastname", "Salutation", "Firstname"), sep = "[,\\.]") 

will try to split each entry by comma or dot, but it runs into a problem in the 514th entry that looks like "Rothschild, Mrs. Martin (Elizabeth L. Barrett)" (notice the second dot).

In short, the easiest way I can see of doing what you want is

data[c("Firstname", "Salutation", "Lastname")] <-
    str_match(data$Name, "([A-Za-z]*),\\s([A-Za-z]*)\\.\\s(.*)")[, 2:4]
konvas
  • 14,126
  • 2
  • 40
  • 46
  • 1
    I'm just a little obsessed with dplyr. It is so easy for a non-programmer to understand that I was hoping to find a dplyr answer. I will go with stringr. Thanks for looking at it. – SandraK Oct 07 '14 at 14:48
  • See also `tidyr::extract` which provides a convenient wrapper for `str_match` – hadley Oct 08 '14 at 23:49
  • 3
    @SandraK I'd strongly encourage you to learn a little bit about regular expressions - at first they look like the cat has walked over the keyboard, but they're very powerful once you master them. – hadley Oct 08 '14 at 23:50
  • @Hadley, tidyr::extract is working for this task. Thanks for advice about regular expressions. I'll keep working on it. – SandraK Oct 10 '14 at 16:18