0

I have been trying to fiddle with this tricky problem and have been searching for the optimal solution. Basically it is sort of finding phrases with same/similar combination of words ( and select only one with a higher value based on a second column value). So far I have used expand.grid() and agrep but had no success.

The other option I am thinking at last resort is to go through every single term and split word by space and try to match possible combinations against all terms, using few for loops. But the cost of computation will be too high since I have considerably bigger size data.

Below is the sample data:

sample <- data.frame(Terms = I(c( "clamp","rod","rod44","rod21","rod21","rod13","rod21","rod12",
                              "rod iron plate","metal plate","plate metal","plates",
                              "rods", "plate rod iron", "11mm rod", "25mm rod", "40mm plate","rod 11mm")), 
              Weights = I(c(10, 10, 10, 10, 10, 10, 10, 10, 
                            50, 45, 60, 20, 30, 100, 30, 20, 40, 50))
              )

DESIRED OUTPUT :

     Terms Weights
  rod 11mm      50
  25mm rod      20
40mm plate      40
     clamp      10
plate rod iron  100
plate metal     60   ..........
Dev Patel
  • 292
  • 1
  • 5
  • 12
  • Not an answer but what you're doing is ngrams related. An R search with this may be useful. Perhaps think about steming as well. But honestly splitting on spaces (and using a proper vectorized approach) is likely much faster than something using `agrep` which is using a complex algorithm. – Tyler Rinker Feb 14 '14 at 16:55
  • If you do go the splitting route, look at @AnandaMahto 's potential solution in this **[SO Question](http://stackoverflow.com/questions/21733345/best-way-to-manipulate-strings-in-big-data-table/21736272#21736272)** – BrodieG Feb 14 '14 at 16:58
  • Thanks @BrodieG. I will take a look at it if I won't find the simplest approach. I just came across a new function **permn()** from package **combinat** that at least gives all possible combination/permutations. I am going to try and see :). – Dev Patel Feb 14 '14 at 19:11

1 Answers1

1

Here's an approach that works with your data but the problem may be more complex:

library(splitstackshape)
dat2 <- splitstackshape:::read.concat(sample[, 1], "Term", " ")

dat2[dat2 == ""] <- NA
key <- apply(dat2, 1, function(x) paste(sort(x), collapse = " "))

dat3 <- split(sample, key)
do.call(rbind, lapply(dat3, function(x) {
    x[which.max(x$Weights), ]
}))

##                         Terms Weights
## 11mm rod             rod 11mm      50
## 25mm rod             25mm rod      20
## 40mm plate         40mm plate      40
## clamp                   clamp      10
## iron plate rod plate rod iron     100
## metal plate       plate metal      60
## plates                 plates      20
## rod                       rod      10
## rod12                   rod12      10
## rod13                   rod13      10
## rod21                   rod21      10
## rod44                   rod44      10
## rods                     rods      30
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Thank Tyler. I tried using your logic on a little bigger piece of the data and it doesn't give the correct results. I think its the **split** function that doesn't give correct split results. – Dev Patel Feb 14 '14 at 19:09
  • Without isolating what the problem is we can't be of much help. Isolate the problem and then add that corner case to the data set you supply above. – Tyler Rinker Feb 14 '14 at 19:37
  • I have updated data and tried with your suggested code and seems like its not reading correctly using **read.table**. I am going to try and read it differently, may be using regex and see if it works. – Dev Patel Feb 14 '14 at 21:24
  • I got annoyed with `read.table` and went with the `library(splitstackshape)` package. I think it can be done with `read.table` but haven't invested the time to figure out how. – Tyler Rinker Feb 14 '14 at 21:45
  • Ha..Thanks Tyler. I also figured out another way to read it using below code but your definitely looks much nicer. I didn't know about **splitstackshape** lib. `url_parts_search <- lapply(sample$Terms, strsplit, ' ', perl=TRUE) dat2 <- data.frame(do.call(rbind.fill.matrix, lapply(url_parts_search, function(x) do.call (rbind, x))))` – Dev Patel Feb 14 '14 at 21:50