I need to rephrase my question, because i didn´t include other kinds of data in my data frame which cause a lot of problems when splitting by blanks. I´m really sorry!
Important: blanks cannot be trusted all over the dataset, because they appear in an erratic manner, even within the same type of date (type1a, type1b in my example)
df <- data.table(v=c( " 555 OUT XYZ STR44W PASSED TRUE", #interesting data type1
" A 45 OUT XYW STR44W PASSED TRUE",
" 555 OUT XYZ STR55W PASSED TRUE",
" 6755 OUT XYZ 4444W PASSED TRUE",
" 75/850CC/PF ", #eratic data to be ignored
" BY HHU 56TT00 6 415 UP HHU 88H900 ", #interesting data type2
" 555 OUT WWWZ STR44W PASSED TRUE"))
desired outcome:
T1 T2 T1_V1 T1_V2 T1_V3 T2_V1 T2_V2 T2_V3 T2_V4 T2_V5
1 0 555 XYZ STR44W NA NA NA NA NA
1 0 A 45 XYW STR44W NA NA NA NA NA
1 0 555 XYZ STR55W NA NA NA NA NA
1 0 6755 XYZ 4444W NA NA NA NA NA
0 0 NA NA NA NA NA NA NA NA
0 1 NA NA NA HHU 56TT00 6 415 HHU 88H900
1 0 555 NA STR44W NA NA NA NA NA
Solution by now for type1 data: library(data.table)
df <- data.table(v=c( " 555 OUT XYZ STR44W PASSED TRUE", #Type1a " A 45 OUT XYW STR44W PASSED TRUE", #Type1b " 555 OUT XYZ STR55W PASSED TRUE", #Type1a " 6755 OUT XYZ 4444W PASSED TRUE", #Type1a " 75/850CC/PF ", #!!new line of eratic data " BY HHU 56TT00 6 415 UP HHU 88H900 ", #Type2 " 555 OUT WWWZ STR44W PASSED TRUE" )) #Type1a
df$T1<-0
df$T1[grepl("PASSED TRUE", df$v)]<-1
df$T1_V1[df$T1==1]<-df$T1_V1
df$T1_V1[df$T1==1] <-gsub("\\OUT.*","",df$v) #Getting rid of the everything after "OUT"
df$T1_V2[df$T1==1]<-gsub(".*\\OUT","",df$v)#Getting rid of the everything before "OUT"
df$T1_V2 <-gsub("\\PASSED.*","",df$T1_V2) #Getting rid of the everything after "PASSED"
df$T1_V2<-strsplit(df$T1_V2, "[[:blank:]*]") # Seperation of the two relevant strings by stringsplit
df$T1_V2<- lapply(df$T1_V2, head)
Problem: stringplit allows for selecting the right block after single out eratic datastructures with more blanks
-> gsub the string down is extra effort, but with some of the data it was the only way to ensure that erratic blanks don´t mess up the variables
-> how to convert df$T1_V2 back to normal character variable with no blanks in it?
Old question:
First post, i tried my very best to find answers and to prepare my question.
I need to clean up a nasty string with a lot of spaces an irregularities. I try to get the first block before "OUT" and the second and third block between "OUT" and "PASSED". Afterwards the data should be checked with a list to control if v4 is correct.
Using stringsplit and afterwars head/tail doesn´t work and i would appreciate any help with that a lot! Many thanks in advance
library(data.table)
df <- data.table(v=c(" 555 OUT XYZ STR44W PASSED TRUE",
" A 45 OUT XYW STR44W PASSED TRUE",
" 555 OUT XYZ STR55W PASSED TRUE",
" 6755 OUT XYZ 4444W PASSED TRUE",
" 555 OUT WWWZ STR44W PASSED TRUE"))
control <-data.table(control=c("XYZ","PPO","XMX","WWWZ"))
df$v1 <-gsub("\\OUT.*","",df$v) #Getting rid of the everything after "OUT"
df$v2<-gsub(".*\\OUT","",df$v) #Getting rid of the everything before "OUT"
df$v2 <-gsub("\\PASSED.*","",df$v2) #Getting rid of the everything after "PASSED"
df$v2<-strsplit(df$v2, "[[:blank:]*]") # Seperation of the two relevant strings by stringsplit
df$v3<- lapply(df$v2, head) #Taking the first element from the stringsplit
df$v4<- lapply(df$v2, head,2) #Taking the second element from the stringsplit
After running that, in r-studio i get c("", "XYZ") for v4. The first element appears to be an empty element? I wasn´t able to go on with that expression by controlling directly from my control list (fail1) nor by converting (fail2) nor by unlist (fail3)
#fail#1
df$v4[!(df$v4 %in% control$control)] <- NA
#fail#2
df$v4 <- as.character(df$v4)
#fail3
df$v4 <- unlist(df$v4)