4

i have a following example:

dat <- read.table(text="index  string
1      'I have first and second'
2      'I have first, first'
3      'I have second and first and thirdeen'", header=TRUE)


toMatch <-  c('first', 'second', 'third')

dat$count <- stri_count_regex(dat$string, paste0('\\b',toMatch,'\\b', collapse="|"))

dat

index                               string count
1     1              I have first and second     2
2     2                  I have first, first     2
3     3 I have second and first and thirdeen     2

I want to add to the dataframe a column count, which will tell me how many UNIQUE words does each row have. The desired output would in this case be

index                               string count
1     1              I have first and second     2
2     2                  I have first, first     1
3     3 I have second and first and thirdeen     2

Could you please give me a hint how to modify the original formula? Thank you very much

Florian
  • 24,425
  • 4
  • 49
  • 80
LMach
  • 43
  • 3

2 Answers2

2

With base R you could do the following:

sapply(dat$string, function(x) 
    {sum(sapply(toMatch, function(y) {grepl(paste0('\\b', y, '\\b'), x)}))})

which returns

[1] 2 1 2

Hope this helps!

Florian
  • 24,425
  • 4
  • 49
  • 80
  • 1
    Only one loop is required: `sapply(stri_extract_all_regex(dat$string, paste0('\\b',toMatch,'\\b', collapse="|")), function(x) length(unique(x)))` – talat Apr 24 '18 at 08:10
  • That's also a nice solution, and a bit closer to OP's own attempt. I think you could/should add that as an answer :) – Florian Apr 24 '18 at 08:14
1

We can use stri_match_all instead which gives us the exact matches and then calculate distinct values using n_distinct or length(unique(x)) in base.

library(stringi)
library(dplyr)
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
                    collapse="|")), n_distinct)

#[1] 2 1 2

Or similary in base R

sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
         collapse="|")), function(x) length(unique(x)))

#[1] 2 1 2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213