1

I want to manipulate substrings in one column based on the indices of these substrings stored in another column of a dataframe:

Data:

df_test
                               Turn                              c5                              Turns_split
1 we 're not gon na know the person PNP VBB XX0 VVG TO0 VVI AT0 NN1 we, 're, not, gon, na, know, the, person
2                      great answer                         AJ0 NN1                            great, answer
3                 it 's gon na rain             PNP VBZ VVG TO0 VVI                    it, 's, gon, na, rain
                                c5_split Index
1 PNP, VBB, XX0, VVG, TO0, VVI, AT0, NN1     4
2                               AJ0, NN1      
3                PNP, VBZ, VVG, TO0, VVI     3

The indices (the values 4 and 3) are stored in column Index; the substrings I want to manipulate are stored in c5, which contains Part-of-Speech tags. The manipulation I would like to do is focused on two substrings in c5: (i) the substring whose index is the same as the index value in Index and (ii) the substring right thereafter, i.e., the substring with the Index value + 1. The manipulation I want to carry out is to replace the whitespace between the two substrings with an = sign. So the desired output in column c5 is this:

df_text$c5
"PNP VBB XX0 VVG=TO0 VVI AT0 NN1" "AJ0 NN1"                         "PNP VBZ VVG=TO0 VVI"

I'm really at a loss for how to do this and would therefore be grateful for guidance.

Reproducible data:

df_test <- structure(list(Turn = c("we 're not gon na know the person", 
"great answer", "it 's gon na rain"), c5 = c("PNP VBB XX0 VVG TO0 VVI AT0 NN1", 
"AJ0 NN1", "PNP VBZ VVG TO0 VVI"), Turns_split = list(c("we", 
"'re", "not", "gon", "na", "know", "the", "person"), c("great", 
"answer"), c("it", "'s", "gon", "na", "rain")), c5_split = list(
    c("PNP", "VBB", "XX0", "VVG", "TO0", "VVI", "AT0", "NN1"), 
    c("AJ0", "NN1"), c("PNP", "VBZ", "VVG", "TO0", "VVI")), Index = list(
    4L, integer(0), 3L)), row.names = c(NA, -3L), class = "data.frame")
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Are the alphanumeric groupings in `c5` always in groups of 3? Why was the first value changed? Its index number is `1`, but the value of `Index` is `4` (i) and it's not following a row that does match (ii)? – LMc Jan 14 '21 at 17:26
  • Most of the time the substrings in `c5` are just 3 chars, sometimes though they can have this structure: XXX-XXX, i.e. three uppercase chars followed by a hyphen, again 3 uppercase chars. – Chris Ruehlemann Jan 14 '21 at 21:36

1 Answers1

1

Try this

for(i in 1:nrow(df_test)){
  if(length(df_test$Index[[i]])==0) next()
  s = unlist(strsplit(df_test$c5[i],split = " "))
  s[df_test$Index[[i]]] = paste0(s[df_test$Index[[i]]],"=",s[df_test$Index[[i]]+1])
  df_test$c5[i] = paste(s[-(df_test$Index[[i]]+1)],collapse = " ")
}
Almog5690
  • 76
  • 4
  • Sorry, I don't get the correct result (also plz change `index`to `Index`) – Chris Ruehlemann Jan 14 '21 at 21:35
  • Hey, I edited my answer, hopefully now it works. – Almog5690 Jan 15 '21 at 06:58
  • No it still does not produce the desired changes – Chris Ruehlemann Jan 15 '21 at 08:08
  • Hey, sorry for the late response, I fixed the code and I tested it my self and it works for me. – Almog5690 Jan 15 '21 at 20:12
  • It works! Wonderful! Would you perhaps explain to me the details of the `for` loop? – Chris Ruehlemann Jan 15 '21 at 20:25
  • The second line of the loop splits the text in ```c5``` to a character vector (contains the 3 letters word from ```c5```). The third line places the desirable string (with the equal sign) in the index from ```Index```. In the fourth line, we paste the character vector back together not before removing the redundant character in the ```Index + 1``` place. I hope this explanation is understandable... – Almog5690 Jan 15 '21 at 21:04