1

any assistance on my problem would be very appreciated, thanks.

I have a data frame where the second column has had 'selected' words extracted from the first column (in previous steps) that have now often (but not always) left them in a different running order . I now need to get the words in the column 'wordsDF$subbed' in the same running order as they are found in the column 'wordsDF$original'.

I have posted a small subset to illustrate with a fourth column (wordsDF$target) that i have completed by hand to demonstrate my goal.

I am attempting to create the third column (wordsDF$reord) which would be the words of 'wordsDF$subbed' in the order they are found in 'wordsDF$original' using sapply(). I am stuck though on how to pass the sapply function along all the words of the strings of wordsDF$original which are of varying length (i.e. number of words in each string). The only way i can think of achieving this is by using the stringr function str_detect to detect (from left to right) if each word in wordsDF$original is in wordsDF$subbed and if 'yes' to extract that word into wordsDF$reord (pasted with anything already extracted). If 'no', the column wordsDF$reord remains the same.

My solution is below, however, it is hard coded to only inspect and extract the first word. Can anyone please show me how i pass the function along each string please? Or is there a much better approach that reorders wordsDF$subbed and negates the need for wordsDF$reord?

library(stringr)

original = c("heat pump only for 100/150l geyser r410a gas", 
         "alliance allwh 5_dcpt_0kw heat pump only for 200/25",
         "alliance allwinteg 190l integrated heat pump and cylinder r134a gas",
         "aquatouch bt10 cp bottle trap 32x40",
         "aquatouch pop32lux cp slotted pop up basin waste 32mm",
         "aquatouch ci15 cp angle regulating valve only 15x15")

subbed = c("heat pump",
       "heat pump",
       "and cylinder  heat pump",
       "bottle trap",
       "basin  pop up waste",
       "valve")


wordsDF = as.data.frame(cbind(original, subbed))
wordsDF$original = as.character(wordsDF$original)
wordsDF$subbed = as.character(wordsDF$subbed)
wordsDF$reord = character(nrow(wordsDF))
wordsDF$target = c("heat pump","heat pump",
               "heat pump and cylinder",
               "bottle trap","pop up basin waste",
               "valve")

# my attempted solution...
wordsDF$reord = sapply(wordsDF$original, function(x) ifelse(
            test = str_detect(wordsDF$subbed, word(wordsDF$original, 1,1)), 
            yes = paste(wordsDF$reord, str_extract(wordsDF$subbed, word(wordsDF$original, 1,1))),
            no = wordsDF$reord))

thanks in advance!

CallumH
  • 751
  • 1
  • 7
  • 22

1 Answers1

2

Here's a possible base R solution which runs mapply over both split vectors and returns the matched words between the two in the correct order wrapped into paste

Rematch <- function(x, y) paste(y[sort(match(x, y))], collapse = " ") # Define an helper functions
mapply(Rematch, strsplit(subbed, "\\s+"), strsplit(original, "\\s+"))
# [1] "heat pump"              "heat pump"              "heat pump and cylinder" "bottle trap"            "pop up basin waste"    
# [6] "valve"   
akrun
  • 874,273
  • 37
  • 540
  • 662
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    That has hit the nail firmly on the head. Many many thanks @David. – CallumH Jan 24 '16 at 12:10
  • You could alternatively define `Rematch <- function(x, y) paste(x[match(y, x, nomatch = 0)], collapse = " ")`. Also, if there is always a single space, you could enhance performance specifying `sep = " "` and `fixed = TRUE` in both `strsplit` calls (instead of `"\\s+"`) – David Arenburg Jan 24 '16 at 12:15