Using strsplit when required split character vector is not consistent for all observations in variable (R)

Question

I have data that looks like the following:

   duration                       obs   another
 1 1.801760     ID: 10 DAY: 6/10/13 S    orange
 2 1.868500     ID: 10 DAY: 6/10/13 S     green
 3 0.233562     ID: 10 DAY: 6/10/13 S    yellow
 4 5.538760       ID:96 DAY: 6/8/13 T    yellow
 5 3.436700       ID:96 DAY: 6/8/13 T      blue
 6 0.533856       ID:96 DAY: 6/8/13 T      pink
 7 2.302250       ID:96 DAY: 6/8/13 T    orange
 8 2.779420       ID:96 DAY: 6/8/13 T     green

I have included just 3 variables, though in reality my data has many. My issue is to do with the ugly looking "obs" variable. I received these data from another individual who entered this information inconsistently into the software that they were using.

'obs' contains three pieces of information: - id (ID: 10, ID:96, etc.) - date (M/D/Y) - identifier (S or T)

I want to split this information and extract the ID number (10 or 96), the date (e.g. 6/8/13) and the identifier (S or T).

To do this I tried the following using strsplit:

temp<-strsplit(as.character(df$obs), " ")
mat<-matrix(unlist(temp), ncol=5, byrow=TRUE)

I thought this would work as in my real data, I have >130,000 observations and I did not realize that some observations had the issue whereby the id did not have a whitespace " " between the "ID:" and the number. In the data above for instance, "ID:96" does not have white space between the colon and the number. Obviously, I got this warning message:

Warning message:
  In matrix(unlist(temp), ncol = 5, byrow = TRUE) :
  data length [796454] is not a sub-multiple or multiple of the number of rows [159291]

Clearly, the strsplit cannot be coerced into nice regular columns as the output of the strsplit takes two forms:

[1] "ID:"     "10"      "DAY:"    "6/10/13" "S"   #when there is whitespace
[1] "ID:96"  "DAY:"   "6/8/13" "T"   #when there isn't whitespace

To try and get round this I did this, thinking that if I could introduce any space after the 'ID:' it could work:

df$obs <- gsub("ID:", "ID: ", df$obs)

But this didn't work as when I then did the strsplit, it would recognize the double whitespace as two places to split the data.

If anybody knows of a solution for multiple strsplits that can then be coerced back into the original df with separate columns for the idnumber, date, identifier, that would be great.

edit: sorry, forgot to add the data for a reproducible example:

df<-structure(list(duration = c(1.80176, 1.8685, 0.233562, 5.53876, 
                        3.4367, 0.533856, 2.30225, 2.77942), obs = structure(c(1L, 1L, 
                                                                               1L, 2L, 2L, 2L, 2L, 2L), .Label = c("ID: 10 DAY: 6/10/13 S", 
                                                                                                                   "ID:96 DAY: 6/8/13 T"), class = "factor"), another = structure(c(3L, 
                                                                                                                                                                                    2L, 5L, 5L, 1L, 4L, 3L, 2L), .Label = c("blue", "green", "orange", 
                                                                                                                                                                                                                            "pink", "yellow"), class = "factor")), .Names = c("duration", 
                                                                                                                                                                                                                                                                              "obs", "another"), class = "data.frame", row.names = c(NA, -8L
                                                                                                                                                                                                                                                                              ))

MrFlick · Accepted Answer · 2014-07-03T15:42:38.637

After you fire that data entry person, I might consider a regular expression here to capture the data. First, here's just the data in the "obs" column (adding the additional value form your comment)

obs<-c("ID: 10 DAY: 6/10/13 S", "ID: 10 DAY: 6/10/13 S", "ID: 10 DAY: 6/10/13 S", 
"ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T", 
"ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T", "ID: 84DAY: 6/8/13 T")

Next, I can capture the data with

m<-regexpr("ID:\\s*(\\d+) ?DAY: (\\d+/\\d+/\\d+) (S|T)", obs, perl=T)

Next, I use a helper function regcapturedmatches() to extract the captured matches (it works like regmatches() but for capture groups)

do.call(rbind, regcapturedmatches(obs,m))

#      [,1] [,2]      [,3]
# [1,] "10" "6/10/13" "S" 
# [2,] "10" "6/10/13" "S" 
# [3,] "10" "6/10/13" "S" 
# [4,] "96" "6/8/13"  "T" 
# [5,] "96" "6/8/13"  "T" 
# [6,] "96" "6/8/13"  "T" 
# [7,] "96" "6/8/13"  "T" 
# [8,] "96" "6/8/13"  "T" 
# [9,] "84" "6/8/13"  "T"

This returns a matrix of values. You can then process those character values how ever you like. You can convert them to the correct class and attach to your data.frame.

But if you did want to use a strsplit, you could split on either a ":" or spaces with an options preceding ":"

do.call(rbind, strsplit(obs,"(:|:?\\s+)", obs))

#      [,1] [,2]    [,3]     [,4]      [,5]
# [1,] "ID" "10"    "DAY"    "6/10/13" "S" 
# [2,] "ID" "10"    "DAY"    "6/10/13" "S" 
# [3,] "ID" "10"    "DAY"    "6/10/13" "S" 
# [4,] "ID" "96"    "DAY"    "6/8/13"  "T" 
# [5,] "ID" "96"    "DAY"    "6/8/13"  "T" 
# [6,] "ID" "96"    "DAY"    "6/8/13"  "T" 
# [7,] "ID" "96"    "DAY"    "6/8/13"  "T" 
# [8,] "ID" "96"    "DAY"    "6/8/13"  "T" 
# [9,] "ID" "84DAY" "6/8/13" "T"       "ID"

which works up till your latest new line of bad data.

Thanks. A quick clarification question. Some of my id's are three digits. e.g. I have "ID:113" and "ID: 120". Looking at the regexpr code, do I need to alter that to address that? — jalapic, Jul 03 '14 at 14:36
@jalapic Nope `\\d+` means one or more digits so three digit IDs are just fine. — MrFlick, Jul 03 '14 at 14:37
this solution is really good and it works well. However, in my real data it only works until line 99101 when I encounter another data entry error: The 'obs' variable for the next several lines looks like this: "ID: 84DAY: 6/8/13 T" the next unique id thereafter is "ID: 96 DAY: 6/10/13 S" about 400 lines later. The regcapturematches function returns "96" "6/10/13" "S" for row 99102 (i.e. the first one where id=84 starts). Is there a way to fix this with the code? or should i manually edit the '84DAY' error using gsub? (The data entry person no longer works for me!) — jalapic, Jul 03 '14 at 15:26
I don't think we can fix the `strsplit` method for data like that, but you could change the regular expression to `m<-regexpr("ID:\\s*(\\d+)\\s?DAY: (\\d+/\\d+/\\d+) (S|T)", obs, perl=T)`. This now says there will be an optional space before "DAY" instead of a required one. — MrFlick, Jul 03 '14 at 15:37

score 3 · Answer 2 · answered Jul 03 '14 at 15:39

You may also use:

  read.table(text=gsub(":"," ", df$obs),header=F,stringsAsFactors=F)
  V1 V2  V3      V4 V5
# 1 ID 10 DAY 6/10/13  S
# 2 ID 10 DAY 6/10/13  S
# 3 ID 10 DAY 6/10/13  S
# 4 ID 96 DAY  6/8/13  T
# 5 ID 96 DAY  6/8/13  T
# 6 ID 96 DAY  6/8/13  T
# 7 ID 96 DAY  6/8/13  T
# 8 ID 96 DAY  6/8/13  T

Using strsplit when required split character vector is not consistent for all observations in variable (R)

2 Answers2