0

I need to rephrase my question, because i didn´t include other kinds of data in my data frame which cause a lot of problems when splitting by blanks. I´m really sorry!

Important: blanks cannot be trusted all over the dataset, because they appear in an erratic manner, even within the same type of date (type1a, type1b in my example)

df <- data.table(v=c( "       555 OUT XYZ STR44W PASSED TRUE",   #interesting data type1
                      "       A 45 OUT XYW STR44W PASSED TRUE",
                      "       555 OUT XYZ STR55W PASSED TRUE",
                      "       6755 OUT XYZ 4444W PASSED TRUE",
                      "  75/850CC/PF                         ",   #eratic data to be ignored
                      " BY HHU 56TT00 6 415 UP HHU 88H900   ",   #interesting data type2
                      "       555 OUT WWWZ STR44W PASSED TRUE")) 

desired outcome:

T1  T2  T1_V1   T1_V2   T1_V3   T2_V1   T2_V2   T2_V3   T2_V4 T2_V5
1   0   555     XYZ     STR44W  NA      NA      NA      NA    NA
1   0   A 45    XYW     STR44W  NA      NA      NA      NA    NA
1   0   555     XYZ     STR55W  NA      NA      NA      NA    NA
1   0   6755    XYZ     4444W   NA      NA      NA      NA    NA
0   0   NA      NA      NA      NA      NA      NA      NA    NA
0   1   NA      NA      NA      HHU     56TT00  6 415   HHU   88H900
1   0   555     NA      STR44W  NA      NA      NA      NA    NA

Solution by now for type1 data: library(data.table)

df <- data.table(v=c( " 555 OUT XYZ STR44W PASSED TRUE", #Type1a " A 45 OUT XYW STR44W PASSED TRUE", #Type1b " 555 OUT XYZ STR55W PASSED TRUE", #Type1a " 6755 OUT XYZ 4444W PASSED TRUE", #Type1a " 75/850CC/PF ", #!!new line of eratic data " BY HHU 56TT00 6 415 UP HHU 88H900 ", #Type2 " 555 OUT WWWZ STR44W PASSED TRUE" )) #Type1a

df$T1<-0
df$T1[grepl("PASSED TRUE", df$v)]<-1
df$T1_V1[df$T1==1]<-df$T1_V1
df$T1_V1[df$T1==1] <-gsub("\\OUT.*","",df$v) #Getting rid of the everything after "OUT"
df$T1_V2[df$T1==1]<-gsub(".*\\OUT","",df$v)#Getting rid of the everything before "OUT"
df$T1_V2 <-gsub("\\PASSED.*","",df$T1_V2) #Getting rid of the everything after "PASSED"
df$T1_V2<-strsplit(df$T1_V2, "[[:blank:]*]") # Seperation of the two relevant strings by stringsplit
df$T1_V2<- lapply(df$T1_V2, head)

Problem: stringplit allows for selecting the right block after single out eratic datastructures with more blanks

-> gsub the string down is extra effort, but with some of the data it was the only way to ensure that erratic blanks don´t mess up the variables

-> how to convert df$T1_V2 back to normal character variable with no blanks in it?

Old question:

First post, i tried my very best to find answers and to prepare my question.

I need to clean up a nasty string with a lot of spaces an irregularities. I try to get the first block before "OUT" and the second and third block between "OUT" and "PASSED". Afterwards the data should be checked with a list to control if v4 is correct.

Using stringsplit and afterwars head/tail doesn´t work and i would appreciate any help with that a lot! Many thanks in advance

library(data.table)

df <- data.table(v=c("       555 OUT XYZ STR44W PASSED TRUE",
                                  "       A 45 OUT XYW STR44W PASSED TRUE",
                                  "       555 OUT XYZ STR55W PASSED TRUE",
                                  "       6755 OUT XYZ 4444W PASSED TRUE",
                                  "       555 OUT WWWZ STR44W PASSED TRUE"))
control <-data.table(control=c("XYZ","PPO","XMX","WWWZ"))

df$v1 <-gsub("\\OUT.*","",df$v) #Getting rid of the everything after "OUT"
df$v2<-gsub(".*\\OUT","",df$v) #Getting rid of the everything before "OUT"
df$v2 <-gsub("\\PASSED.*","",df$v2) #Getting rid of the everything after "PASSED"
df$v2<-strsplit(df$v2, "[[:blank:]*]") # Seperation of the two relevant strings by stringsplit
df$v3<- lapply(df$v2, head) #Taking the first element from the stringsplit 
df$v4<- lapply(df$v2, head,2) #Taking the second element from the stringsplit 

After running that, in r-studio i get c("", "XYZ") for v4. The first element appears to be an empty element? I wasn´t able to go on with that expression by controlling directly from my control list (fail1) nor by converting (fail2) nor by unlist (fail3)

#fail#1
df$v4[!(df$v4 %in% control$control)] <- NA

#fail#2
df$v4 <- as.character(df$v4)

#fail3
df$v4 <- unlist(df$v4)
rolling r
  • 1
  • 1

2 Answers2

1

This works for your current data to get it into a tidier form.

library(data.table)

df <- data.table(v=c("       555 OUT XYZ STR44W PASSED TRUE",
                                  "       A 45 OUT XYW STR44W PASSED TRUE",
                                  "       555 OUT XYZ STR55W PASSED TRUE",
                                  "       6755 OUT XYZ 4444W PASSED TRUE",
                                  "       555 OUT WWWZ STR44W PASSED TRUE"))
control <-data.table(control=c("XYZ","PPO","XMX","WWWZ"))

df$v1 <-gsub("\\OUT.*","",df$v) #Getting rid of the everything after "OUT"
df$v2<-gsub(".*\\OUT","",df$v) #Getting rid of the everything before "OUT"

trim the whitespace and split it out by spaces, then cbind it to the current df. we can then rename the columns so it's a bit easier to navigate.

lists <- strsplit(trimws(df$v2), " ")
extra <- data.frame(do.call(rbind, lists))
newdf <- cbind(df, extra)
colnames(newdf) <- c("full string", paste0("piece_", 1:6))

newdf
                              full string      piece_1                  piece_2 piece_3 piece_4 piece_5 piece_6
1:         555 OUT XYZ STR44W PASSED TRUE         555    XYZ STR44W PASSED TRUE     XYZ  STR44W  PASSED    TRUE
2:        A 45 OUT XYW STR44W PASSED TRUE        A 45    XYW STR44W PASSED TRUE     XYW  STR44W  PASSED    TRUE
3:         555 OUT XYZ STR55W PASSED TRUE         555    XYZ STR55W PASSED TRUE     XYZ  STR55W  PASSED    TRUE
4:         6755 OUT XYZ 4444W PASSED TRUE        6755     XYZ 4444W PASSED TRUE     XYZ   4444W  PASSED    TRUE
5:        555 OUT WWWZ STR44W PASSED TRUE         555   WWWZ STR44W PASSED TRUE    WWWZ  STR44W  PASSED    TRUE
Matt W.
  • 3,692
  • 2
  • 23
  • 46
1

I did not completely understand what your end result requirement is. You do not need to use gsub at every step. You can split everything by space, and select the columns you would need to work further.

library(tidyr)
library(splitstackshape) # cSplit function    

df_selected <- df %>% cSplit("v", " ") %>% select(v_1,v_3,v_4,v_6)

control <-data.table(control=c("XYZ","PPO","XMX","WWWZ"))
filter(df_selected, v_3 %in% control$control)
user5249203
  • 4,436
  • 1
  • 19
  • 45