Count occurrences of multiple strings in one character variable

Question

I have a dataset of tweets downloaded with rtweet. And i'd like to see how many times three different strings occur in the variable x$mentions_screen_name.

The key thing I'm trying to do is do a count of how many times 'A' occurs, then 'B', then 'C'. So my attempt at reproducing this is as follows.

#These are the strings I would like to count
var<-c('A', 'B', 'C')
#The variable that contains the strings looks like this
library(stringi)
df<-data.frame(var1=stri_rand_strings(100, length=3, '[A-C]'))
#How do I count how many cases contain A, then B and then C.?
library(purrr)
df%>% 
  map(var, grepl(., df$var1))

Rich Scriven · Answer 1 · 2018-03-09T19:29:32.630

1

You can do this easily by summing the columns after running grepl() through sapply().

colSums(sapply(var, grepl, df$var1))
#  A  B  C 
# 72 72 69

edited Mar 09 '18 at 19:29

answered Mar 09 '18 at 19:22

Rich Scriven

97,041
11
181
245

score 1 · Answer 2 · answered Mar 09 '18 at 19:36

If you want to count ALL occurences (so also multiple within a single string), you can use str_count from the stringr package.

map_int(var, ~sum(stringr::str_count(df$var1, .)))
[1]  90 112  98

Otherwise, you can use str_detect.

map_int(var, ~sum(stringr::str_detect(df$var1, .)))
[1] 66 71 70

score 1 · Answer 3 · answered Mar 10 '18 at 16:53

I think you may want something different than what others have posted. I may be wrong but the phrase you used:

 'A' occurs, then 'B', then 'C'

Indicates to me you want to check if somethings happen in a particular order.

If this is the case may I suggest that you can make your question more explicit. You provide a MWE example but it could be made more minimal without the need for stringi (which I love as a package) because I doubt your tweets look anything like "ACB" in reality. Hand making 3-5 strings could accomplish this without loading another package. Also showing your desired output makes the problem more explicit with less need for explanation.

df <- data_frame(var1=c(
    "I think A is good But then C.",
    "'A' occurs, then 'B', then 'C'",
    "and a then lower with b that c will fail",
    NA,
    "what about A, B, C and another ABC",
    "CBA?",
    "last null"
))

var <- c('A', 'B', 'C')

library(stringi); library(dplyr)

df%>% 
    mutate(
        count_abc = stringi::stri_count_regex(
            var1, 
            paste(var, collapse = '.*?')
        ),
        indicator = count_abc > 0
    )

##   var1                                     count_abc indicator
## 1 I think A is good But then C.                    1 TRUE     
## 2 'A' occurs, then 'B', then 'C'                   1 TRUE     
## 3 and a then lower with b that c will fail         0 FALSE    
## 4 <NA>                                            NA NA       
## 5 what about A, B, C and another ABC               2 TRUE     
## 6 CBA?                                             0 FALSE    
## 7 last null                                        0 FALSE   

## or if you only care about the summary compute it directly
df%>% 
    summarize(
        count_abc = sum(stringi::stri_detect_regex(
            var1, 
            paste(var, collapse = '.*?')
        ), na.rm = TRUE)
    )


##   count_abc
## 1         3

If I'm wrong my apologies for my misunderstanding.

score 0 · Accepted Answer · answered Mar 09 '18 at 19:43

0

Another option using stringr and sapply could be:

library(stringr)
set.seed(1)
df<-data.frame(var1=stri_rand_strings(100, length=3, '[A-C]'))

var<-c('A', 'B', 'C')
colSums(sapply(var, function(x,y)str_count(y, x), df$var1 ))
#A   B   C 
#101 109  90

answered Mar 09 '18 at 19:43

MKR

19,739
4
23
33

This is the nicest answer because I like how it sticks with the stringr library. – spindoctor Mar 13 '18 at 14:28

Count occurrences of multiple strings in one character variable

4 Answers4