How to search for words with asterisks and wildcards (e.g., exampl*) in R (word appearance in a data frame)

Question

I wrote a code to count the appearance of words in a data frame:

Items <-  c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
item <- vector() 
count <- vector()
for (i in 1:length(unique(Items))){ 
item[i] <- Items[i] 
count[i]<- sum(df_main$words  == item[i])} 
word_freq <- data.frame(cbind(item, count))
word_freq

However, the results are like this:

	item	count
1	decid*	0
2	head	1
3	heads	1

As you see, it does not correctly count for "decid*". The actual results I expect should be like this:

	item	count
1	decid*	2
2	head	1
3	heads	1

I think I need to change the item word (decid*) format, however, I could not figure it out. Any help is much appreciated!

Ronak Shah · Answer 1 · 2021-10-07T10:42:33.337

1

I think you want to use decid* as regex pattern. == looks for an exact match, you may use grepl to look for a particular pattern.

I have used sapply as an alternative to for loop.

result <- stack(sapply(unique(df1$Items), function(x) {
  if(grepl('*', x, fixed = TRUE)) sum(grepl(x, df_main$word))
  else sum(x == df_main$words)
}))

result
# values    ind
#1      2 decid*
#2      1   head
#3      1  heads

edited Oct 07 '21 at 10:42

answered Oct 07 '21 at 10:37

Ronak Shah

377,200
20
156
213

To me it look like `head` should return a value of 1, since there is no wildcard at the end. The author code return 1. – Gowachin Oct 07 '21 at 10:38
1

Thanks for the comment. I think OP wants combination of pattern and exact match. I have updated the answer accordingly. – Ronak Shah Oct 07 '21 at 10:43

Chris Ruehlemann · Answer 2 · 2021-10-08T07:05:08.987

Perhaps as an alternative approach altogether: instead of creating a new dataframe word_freq, why not create a new column in df_main(if that's your "main" dataframe) which indicates the number of matches of your (apparently key)Items. Also, that column will not actually contain counts because the input column words only contains a single word each. So the question is not how many matches are there for each row but whether there is a match in the first place. That can be indicated by greplin base Ror str_detectin stringr

EDIT:

Given the newly posted input data

Items <-  c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)

and the OP's wish to have the matches in df_main, the solution might be this:

library(stringr)
df_main$Items_match <- +str_detect(df_main$words, str_c(Items, collapse = "|"))

Result:

df_main
      words Items_match
1      head           1
2     heads           1
3   decided           1
4   decides           1
5       top           0
6 undecided           1

Hi @Chris Ruehlemann Thanks for your help. I want the count of df1$Items in df_main. So, the results will show the df1$Items and their frequency in df_main. Can you please help with this? — Asghar, Oct 08 '21 at 00:28

akrun · Answer 3 · 2021-10-08T03:12:01.020

0

Using tidyverse

library(dplyr)
library(stringr)
df1 %>% 
   rowwise %>%
   mutate(count =sum(str_detect(df_main$words,
     str_c("\\b", str_replace(Items, fixed("*"), ".*" ), "\\b")))) %>%
   ungroup

-output

# A tibble: 3 × 2
  Items  count
  <chr>  <int>
1 decid*     2
2 head       1
3 heads      1

edited Oct 08 '21 at 03:12

answered Oct 07 '21 at 17:34

akrun

874,273
37
540
662

Hi @akrun Thanks for your help. I want the count of df1$Items in df_main. So, the results will show the df1$Items and their frequency in df_main. It is important to get the correct results as the second table in the question. Could you please help with this? – Asghar Oct 08 '21 at 00:39

How to search for words with asterisks and wildcards (e.g., exampl*) in R (word appearance in a data frame)

3 Answers3