2

I am creating a data set to compute the aggregate values for different combinations of words using regex. Each row has a unique regex value which I want to check against another dataset and find the number of times it appeared in it.

The first dataset (df1) looks like this :

   word1    word2               pattern
   air      10     (^|\\s)air(\\s.*)?\\s10($|\\s)
 airport    20   (^|\\s)airport(\\s.*)?\\s20($|\\s)
   car      30     (^|\\s)car(\\s.*)?\\s30($|\\s)

The other dataset (df2) from which I want to match this looks like

   sl_no    query
   1      air 10     
   2    airport 20   
   3    airport 20
   3    airport 20
   3      car 30

The final output I want should look like word1 word2 total_occ air 10 1 airport 20 3 car 30 1

I am able to do this by using apply in R

process <- 
function(x) 
{
  length(grep(x[["pattern"]], df2$query))
}           

df1$total_occ=apply(df1,1,process)

but find it time taking since my dataset is pretty big.

I found out that "mclapply" function of "parallel" package can be used to run such things on multicores, for which I am trying to run lapply first. Its giving me error saying

lapply(df,process)

Error in x[, "pattern"] : incorrect number of dimensions

Please let me know what changes should I make to run lapply correctly.

zx8754
  • 52,746
  • 12
  • 114
  • 209
HoneyBadger
  • 98
  • 2
  • 10
  • You're iterating over the patterns, so that should be your first arg to `lapply`, right? – Frank Jun 17 '15 at 16:03
  • Here;s why you get that error: `lapply` will apply some function to each element of a list in turn; so the function has to be able to operate on the elements of the list. The elements in this context of a dataframe are its columns, so you are asking R to apply `process` to each column of `df`. – tegancp Jun 17 '15 at 16:14

1 Answers1

3

Why not just lapply() over the pattern?

Here I've just pulled out your pattern but this could just as easily be df$pattern

pattern <- c("(^|\\s)air(\\s.*)?\\s10($|\\s)",
             "(^|\\s)airport(\\s.*)?\\s20($|\\s)",
             "(^|\\s)car(\\s.*)?\\s30($|\\s)")

Using your data for df2

txt <- "sl_no    query
   1      'air 10'     
   2    'airport 20'   
   3    'airport 20'
   3    'airport 20'
   3      'car 30'"
df2 <- read.table(text = txt, header = TRUE)

Just iterate on pattern directly

> lapply(pattern, grep, x = df2$query)
[[1]]
[1] 1

[[2]]
[1] 2 3 4

[[3]]
[1] 5

If you want more compact output as suggested in your question, you'll need to run lengths() over the output returned (Thanks to @Frank for pointing out the new function lengths().)). Eg

lengths(lapply(pattern, grep, x = df2$query))

which gives

> lengths(lapply(pattern, grep, x = df2$query))
[1] 1 3 1

You can add this to the original data via

dfnew <- cbind(df1[, 1:2],
               Count = lengths(lapply(pattern, grep, x = df2$query)))
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • And then `lengths` on that if they have the latest version of R – Frank Jun 17 '15 at 16:09
  • 1
    @Frank Yup; I just noticed that part of the Q as it wasn't in any markup. Added that now. – Gavin Simpson Jun 17 '15 at 16:12
  • 1
    Hm, I see you've added a version with `length`, but you can keep your original way and just wrap it in the new function `lengths`, like `lengths(lapply(...etc...))` – Frank Jun 17 '15 at 16:14
  • @Frank +1 I didn't know about `lengths()`! Thanks for the heads-up on that new function. Will update the answer. (When did `lengths()` get into R?) – Gavin Simpson Jun 17 '15 at 16:16
  • Just in 3.2.0. I saw someone show that it's a lot faster than `sapply(x,length)`, but mostly it's just really convenient. – Frank Jun 17 '15 at 16:17
  • 1
    Yep, convenient indeed. Need to decide whether to start using that in some of my packages. Would clean up some code but at expense of dependency on R 3.2.0... – Gavin Simpson Jun 17 '15 at 16:20
  • @Gavin - Thank you for the detailed solution, this is exactly what I wanted to do. This just saved me a whole lot of digging :) – HoneyBadger Jun 17 '15 at 16:27