0

I wrote a code to count the appearance of words in a data frame:

Items <-  c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
item <- vector() 
count <- vector()
for (i in 1:length(unique(Items))){ 
item[i] <- Items[i] 
count[i]<- sum(df_main$words  == item[i])} 
word_freq <- data.frame(cbind(item, count))
word_freq

However, the results are like this:

item count
1 decid* 0
2 head 1
3 heads 1

As you see, it does not correctly count for "decid*". The actual results I expect should be like this:

item count
1 decid* 2
2 head 1
3 heads 1

I think I need to change the item word (decid*) format, however, I could not figure it out. Any help is much appreciated!

Asghar
  • 1
  • 2

3 Answers3

1

I think you want to use decid* as regex pattern. == looks for an exact match, you may use grepl to look for a particular pattern.

I have used sapply as an alternative to for loop.

result <- stack(sapply(unique(df1$Items), function(x) {
  if(grepl('*', x, fixed = TRUE)) sum(grepl(x, df_main$word))
  else sum(x == df_main$words)
}))

result
# values    ind
#1      2 decid*
#2      1   head
#3      1  heads
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • To me it look like `head` should return a value of 1, since there is no wildcard at the end. The author code return 1. – Gowachin Oct 07 '21 at 10:38
  • 1
    Thanks for the comment. I think OP wants combination of pattern and exact match. I have updated the answer accordingly. – Ronak Shah Oct 07 '21 at 10:43
0

Perhaps as an alternative approach altogether: instead of creating a new dataframe word_freq, why not create a new column in df_main(if that's your "main" dataframe) which indicates the number of matches of your (apparently key)Items. Also, that column will not actually contain counts because the input column words only contains a single word each. So the question is not how many matches are there for each row but whether there is a match in the first place. That can be indicated by greplin base Ror str_detectin stringr

EDIT:

Given the newly posted input data

Items <-  c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)

and the OP's wish to have the matches in df_main, the solution might be this:

library(stringr)
df_main$Items_match <- +str_detect(df_main$words, str_c(Items, collapse = "|"))

Result:

df_main
      words Items_match
1      head           1
2     heads           1
3   decided           1
4   decides           1
5       top           0
6 undecided           1
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Hi @Chris Ruehlemann Thanks for your help. I want the count of df1$Items in df_main. So, the results will show the df1$Items and their frequency in df_main. Can you please help with this? – Asghar Oct 08 '21 at 00:28
  • I've edited the solution to fit the new input data. – Chris Ruehlemann Oct 08 '21 at 06:52
0

Using tidyverse

library(dplyr)
library(stringr)
df1 %>% 
   rowwise %>%
   mutate(count =sum(str_detect(df_main$words,
     str_c("\\b", str_replace(Items, fixed("*"), ".*" ), "\\b")))) %>%
   ungroup

-output

# A tibble: 3 × 2
  Items  count
  <chr>  <int>
1 decid*     2
2 head       1
3 heads      1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi @akrun Thanks for your help. I want the count of df1$Items in df_main. So, the results will show the df1$Items and their frequency in df_main. It is important to get the correct results as the second table in the question. Could you please help with this? – Asghar Oct 08 '21 at 00:39