I have data that looks like the following:
duration obs another
1 1.801760 ID: 10 DAY: 6/10/13 S orange
2 1.868500 ID: 10 DAY: 6/10/13 S green
3 0.233562 ID: 10 DAY: 6/10/13 S yellow
4 5.538760 ID:96 DAY: 6/8/13 T yellow
5 3.436700 ID:96 DAY: 6/8/13 T blue
6 0.533856 ID:96 DAY: 6/8/13 T pink
7 2.302250 ID:96 DAY: 6/8/13 T orange
8 2.779420 ID:96 DAY: 6/8/13 T green
I have included just 3 variables, though in reality my data has many. My issue is to do with the ugly looking "obs" variable. I received these data from another individual who entered this information inconsistently into the software that they were using.
'obs' contains three pieces of information: - id (ID: 10, ID:96, etc.) - date (M/D/Y) - identifier (S or T)
I want to split this information and extract the ID number (10 or 96), the date (e.g. 6/8/13) and the identifier (S or T).
To do this I tried the following using strsplit:
temp<-strsplit(as.character(df$obs), " ")
mat<-matrix(unlist(temp), ncol=5, byrow=TRUE)
I thought this would work as in my real data, I have >130,000 observations and I did not realize that some observations had the issue whereby the id did not have a whitespace " " between the "ID:" and the number. In the data above for instance, "ID:96" does not have white space between the colon and the number. Obviously, I got this warning message:
Warning message:
In matrix(unlist(temp), ncol = 5, byrow = TRUE) :
data length [796454] is not a sub-multiple or multiple of the number of rows [159291]
Clearly, the strsplit cannot be coerced into nice regular columns as the output of the strsplit takes two forms:
[1] "ID:" "10" "DAY:" "6/10/13" "S" #when there is whitespace
[1] "ID:96" "DAY:" "6/8/13" "T" #when there isn't whitespace
To try and get round this I did this, thinking that if I could introduce any space after the 'ID:' it could work:
df$obs <- gsub("ID:", "ID: ", df$obs)
But this didn't work as when I then did the strsplit, it would recognize the double whitespace as two places to split the data.
If anybody knows of a solution for multiple strsplits that can then be coerced back into the original df with separate columns for the idnumber, date, identifier, that would be great.
edit: sorry, forgot to add the data for a reproducible example:
df<-structure(list(duration = c(1.80176, 1.8685, 0.233562, 5.53876,
3.4367, 0.533856, 2.30225, 2.77942), obs = structure(c(1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L), .Label = c("ID: 10 DAY: 6/10/13 S",
"ID:96 DAY: 6/8/13 T"), class = "factor"), another = structure(c(3L,
2L, 5L, 5L, 1L, 4L, 3L, 2L), .Label = c("blue", "green", "orange",
"pink", "yellow"), class = "factor")), .Names = c("duration",
"obs", "another"), class = "data.frame", row.names = c(NA, -8L
))