Why reinvent the wheel? The quanteda package is built for this.
Define a vector of your fruits, which as a bonus I've used with the (default) glob pattern match type to catch both singular and plural forms.
A <- c("I have a lot of pineapples, apples and grapes. One day the pineapples person gave the apples person two baskets of grapes")
fruits <- c("apple*", "pineapple*", "grape*", "banana*")
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.2
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
Then once you have tokenized this into words using tokens()
, you can send the result to tokens_select()
using your vector fruits
to select just those types.
toks <- tokens(A) %>%
tokens_select(pattern = fruits)
toks
## tokens from 1 document.
## text1 :
## [1] "pineapples" "apples" "grapes" "pineapples" "apples"
## [6] "grapes"
Finally, ntype()
will tell you the number of word types (unique words), which is your desired output of 3.
ntype(toks)
## text1
## 3
Alternatively you could have counted non-unique occurrences, known as tokens.
ntoken(toks)
## text1
## 6
Both functions are vectorized to return a named integer vector where the element name will be your document name (here, the quanteda default of "text1" for the single document), so this also works easily and efficiently on a large corpus.
Advantages? Easier (and more readable) than regular expressions, plus you have access to additional function for tokens. For instance, let's say you wanted to consider singular and plural fruit patterns as equivalent. You could do this in two ways in quanteda: through replacing the pattern with a canonical form manually using tokens_replace()
, or by stemming the fruit names using tokens_wordstem()
.
Using tokens_replace()
:
B <- "one apple, two apples, one grape two grapes, three pineapples."
toksrepl <- tokens(B) %>%
tokens_select(pattern = fruits) %>%
tokens_replace(
pattern = fruits,
replacement = c("apple", "pineapple", "grape", "banana")
)
toksrepl
## tokens from 1 document.
## text1 :
## [1] "apple" "apple" "grape" "grape" "pineapple"
ntype(toksrepl)
## text1
## 3
Using tokens_wordstem()
:
toksstem <- tokens(B) %>%
tokens_select(pattern = fruits) %>%
tokens_wordstem()
toksstem
## tokens from 1 document.
## text1 :
## [1] "appl" "appl" "grape" "grape" "pineappl"
ntype(toksstem)
## text1
## 3