Getting the unique count of strings from a text string

Question

I am wondering on how to get the unique number of characters from the text string. Let's say I am looking for a count of repetition of the words apples, bananas, pineapples, grapes in this string.

 A<- c('I have a lot of pineapples, apples and grapes. One day the pineapples person gave the apples person two baskets of grapes')

 df<- data.frame(A)

Let's say I want to get all the unique count of the fruits listed in the text.

  library(stringr)
  df$fruituniquecount<- str_count(df$A, "apples|pineapples|grapes|bananas")

I tried this but I get the over all count. I would like to the answer as '3'. Please suggest your ideas.

I think you have to look at the``tidytext`` pakcage. Here is a online book: [link](https://www.tidytextmining.com/) — xhr489, Feb 25 '19 at 14:09

score 7 · Accepted Answer · answered Feb 25 '19 at 14:13

7

You could use str_extract_all and then calculate the length of the unique elements.

Input:

A <- c('I have a lot of pineapples, apples and grapes. One day the pineapples person gave the apples person two baskets of grapes')
fruits <- "apples|pineapples|grapes|bananas"

Result

length(unique(c(stringr::str_extract_all(A, fruits, simplify = TRUE))))
# [1] 3

answered Feb 25 '19 at 14:13

markus

25,843
5
39
58

I am getting a strange error when I try this on my dataset and create a column called df$fruitcount. I have many rows and and the count is always given as 5. Can you please suggest if i am missing anything? – user3570187 Feb 25 '19 at 14:23
Please share the output of `dput(head(your_dataframe))` at the end of your question. – markus Feb 25 '19 at 14:28
Yes I have added the data in the question and the expected output – user3570187 Feb 25 '19 at 14:36
@user3570187 This seems like a different story to me. As you received quite a few answers now I'd suggest you ask another one with the data that you just posted and accept / upvote the answers that solved this problem. – markus Feb 25 '19 at 14:39
1

Agree with @markus that your edits should be a different question. – tmfmnk Feb 25 '19 at 14:40
Thanks for the help! I posted another question. – user3570187 Feb 25 '19 at 14:47

Ben G · Answer 2 · 2019-03-04T03:15:29.760

3

Not exactly elegant, but you could use str_detect like this.

sum(str_detect(df$A, "apples"), 
    str_detect(df$A, "pineapples"), 
    str_detect(df$A, "grapes"), 
    str_detect(df$A, "bananas"))

Or, based on the comments below, if you put all these terms in their own vector you could then use an apply function:

fruits <- c("apples", "pineapples", "grapes", "bananas")
sum(sapply(fruits, function(x) str_detect(df$A, x)))

edited Mar 04 '19 at 03:15

answered Feb 25 '19 at 14:13

Ben G

4,148
2
22
42

I am getting a strange error when I try this on my dataset and create a column called df$fruitcount. I have many rows and and the count is always given as very large number. Can you please suggest if i am missing anything? – user3570187 Feb 25 '19 at 14:24
1

This could be shortened to `sum(sapply(fruits, function(x) str_detect(df$A, x)))`, with `fruits <- c("apples", "pineapples", "grapes", "bananas")`. – LAP Feb 25 '19 at 14:25

tmfmnk · Answer 3 · 2021-03-08T07:55:24.240

3

One base possibility could be:

length(unique(unlist(regmatches(A, gregexpr("apples|pineapples|grapes|bananas", A, perl = TRUE)))))

[1] 3

edited Mar 08 '21 at 07:55

answered Feb 25 '19 at 14:24

tmfmnk

38,881
4
47
67

score 2 · Answer 4 · answered Feb 25 '19 at 14:22

Perhaps a better way to do this is by first breaking down the words and then getting the count.

library(tokenizers)
library(magrittr)
df$fruituniquecount <- tokenize_words(A) %>% unlist(.) %>% unique(.) %>% 
       stringr::str_count(., "apples|pineapples|grapes|bananas") %>% sum(.)

arg0naut91 · Answer 5 · 2019-02-25T14:30:20.423

2

Could also do:

A <- c('I have a lot of pineapples, apples and grapes. One day the pineapples person gave the apples person two baskets of grapes')

df <- data.frame(A) 

fruits <- c("apples", "pineapples", "grapes", "bananas")

df$count <- sum(tolower(unique(unlist(strsplit(as.character(df$A), "\\.|,| ")))) %in% fruits)

Output:

[1] 3

edited Feb 25 '19 at 14:30

answered Feb 25 '19 at 14:24

arg0naut91

14,574
2
17
38

score 2 · Answer 6 · answered Feb 25 '19 at 14:24

2

Well, here is a regex-less base R solution as well,

sum(unique(strsplit(A, ' ')[[1]]) %in% c('apples', 'pineapples', 'grapes', 'bananas'))
#[1] 3

answered Feb 25 '19 at 14:24

Sotos

51,121
6
32
66

NelsonGon · Answer 7 · 2019-02-25T14:32:30.827

2

We can use a combination of stringr and stringi:

target<-"apples|pineapples|grapes|bananas"#inspired by @markus ' solution
length(stringi::stri_unique(stringr::str_extract_all(A,target,simplify=TRUE)))
#[1] 3

edited Feb 25 '19 at 14:32

answered Feb 25 '19 at 14:26

NelsonGon

13,015
7
27
57

Ken Benoit · Answer 8 · 2019-03-03T05:27:12.553

Why reinvent the wheel? The quanteda package is built for this.

Define a vector of your fruits, which as a bonus I've used with the (default) glob pattern match type to catch both singular and plural forms.

A <- c("I have a lot of pineapples, apples and grapes. One day the pineapples person gave the apples person two baskets of grapes")
fruits <- c("apple*", "pineapple*", "grape*", "banana*")

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.2
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

Then once you have tokenized this into words using tokens(), you can send the result to tokens_select() using your vector fruits to select just those types.

toks <- tokens(A) %>%
  tokens_select(pattern = fruits)
toks
## tokens from 1 document.
## text1 :
## [1] "pineapples" "apples"     "grapes"     "pineapples" "apples"    
## [6] "grapes"

Finally, ntype() will tell you the number of word types (unique words), which is your desired output of 3.

ntype(toks)
## text1 
##     3

Alternatively you could have counted non-unique occurrences, known as tokens.

ntoken(toks)
## text1 
##     6

Both functions are vectorized to return a named integer vector where the element name will be your document name (here, the quanteda default of "text1" for the single document), so this also works easily and efficiently on a large corpus.

Advantages? Easier (and more readable) than regular expressions, plus you have access to additional function for tokens. For instance, let's say you wanted to consider singular and plural fruit patterns as equivalent. You could do this in two ways in quanteda: through replacing the pattern with a canonical form manually using tokens_replace(), or by stemming the fruit names using tokens_wordstem().

Using tokens_replace():

B <- "one apple, two apples, one grape two grapes, three pineapples."

toksrepl <- tokens(B) %>%
  tokens_select(pattern = fruits) %>%
  tokens_replace(
    pattern = fruits,
    replacement = c("apple", "pineapple", "grape", "banana")
  )
toksrepl
## tokens from 1 document.
## text1 :
## [1] "apple"     "apple"     "grape"     "grape"     "pineapple"
ntype(toksrepl)
## text1 
##     3

Using tokens_wordstem():

toksstem <- tokens(B) %>%
  tokens_select(pattern = fruits) %>%
  tokens_wordstem()
toksstem
## tokens from 1 document.
## text1 :
## [1] "appl"     "appl"     "grape"    "grape"    "pineappl"
ntype(toksstem)
## text1 
##     3

Getting the unique count of strings from a text string

8 Answers8