R Splitting strings with pattern of irregular lengths

Question

I have a data.frame column with +3000 strings, which I would like to have separated, but they are irregular although with a pattern. Here are some examples, and what I would like them converted into.

00700/Z14P120:xhkg
03988/Z14C3.2:xhkg
6A/F15C0.905:xcme
ADS/X14P56:xeur
AX1/X14P375:xams
BIDU/28X14C250:xcbf
ES/F15C1960:xcme
FUR/M16P8:xams

00700 | P | 120
03988 | C | 3.2
6A | C | 0.905
ADS | P | 56
AX1 | P | 375
BIDU | C | 250
ES | C | 1960
FUR | P | 8

I think this covers all possible lengths and value types for each substring.

The first new column should overwrite the input column and the two other columns should overwrite existing column blanks in the same data.frame

Another complication is that there are data.frame rows which are already formatted correctly, there is however a column which identifies the rows which are not. Below is a piece of the table as .CSV output.

enter image description here

Final solution: It turned out to be more difficult than anticipated to replace the values in the existing columns, due to issues with NA's, classes and registration of row numbers. Thus I ended up creating temp columns and replacing the entire column, in this fairly ugly and inefficient way. The code provided by Ananda Mahto does however work brilliantly.

ETO <- as.array(data_results$InstrumentSymbolCode)
ETO <- do.call(rbind, 
        strsplit(gsub("(.*)/[A-Z0-9]+?([A-Z])([0-9\\.-]+)?:.*", 
                      "\\1NONSENSESPLIT\\2NONSENSESPLIT\\3", ETO),
                "NONSENSESPLIT", fixed = TRUE))
ETO[data_results$ProductCategoryID!=9] <- ""

temp1 <- array(0,nrow(ETO))
temp2 <- array(0,nrow(ETO))
temp3 <- array(0,nrow(ETO))
for (i in 1:nrow(ETO)){
  if (data_results$ProductCategoryID[i]==9) {
    temp1[i] <- ETO[i,1]
    temp2[i] <- ETO[i,2]
    temp3[i] <- ETO[i,3]
  }  else {
    temp1[i] <- as.character(data_results$InstrumentSymbolCode[i])
    temp2[i] <- as.character(data_results$PutCall[i])
    temp3[i] <- data_results$Strike[i]
  }
}
data_results$InstrumentSymbolCode<-as.character(temp1)
data_results$PutCall <- temp2
data_results$Strike <- temp3

score 1 · Accepted Answer · answered Nov 10 '14 at 13:41

You can use some regex along with strsplit, perhaps something like this:

do.call(rbind, 
        strsplit(gsub("(.*)/[A-Z0-9]+?([A-Z])([0-9\\.-]+)?:.*", 
                      "\\1NONSENSESPLIT\\2NONSENSESPLIT\\3", mydf$v1),
                 "NONSENSESPLIT", fixed = TRUE))
#      [,1]    [,2] [,3]   
# [1,] "00700" "P"  "120"  
# [2,] "03988" "C"  "3.2"  
# [3,] "6A"    "C"  "0.905"
# [4,] "ADS"   "P"  "56"   
# [5,] "AX1"   "P"  "375"  
# [6,] "BIDU"  "C"  "250"  
# [7,] "ES"    "C"  "1960" 
# [8,] "FUR"   "P"  "8"

It's not exactly clear where/how you want to replace those values in your original data though.

Sample data:

mydf <- data.frame(v1 = c("00700/Z14P120:xhkg", "03988/Z14C3.2:xhkg",
  "6A/F15C0.905:xcme", "ADS/X14P56:xeur", "AX1/X14P375:xams",
  "BIDU/28X14C250:xcbf", "ES/F15C1960:xcme", "FUR/M16P8:xams"))

Thank you! I want the 2nd column into the PutCall column, and the 3rd into the Strike column. Which I should be able to do myself now that I have the separated values. — Marcus, Nov 10 '14 at 15:02

R Splitting strings with pattern of irregular lengths

1 Answers1