unnest_tokens
cleans the punctuations. This removes the hyphens between the complex words.
You can either use quanteda or tm for this as these packages do not by default remove the punctuation. The below examples are assuming that you have a data.frame and are working with a corpus. But quanteda's tokens
function can work directly on text columns.
example <- c("bovine retention-of-placenta sulpha-trimethoprim mineral-vitamin-liquid-mixture")
diagnostics <- data.frame(text = example, stringsAsFactors = FALSE)
with quanteda:
library(quanteda)
qcorp <- corpus(diagnostics)
bigrams <- tokens_ngrams(tokens(qcorp), n = 2, concatenator = " ")
qdfm <- dfm(bigrams)
convert(qdfm, "data.frame")
document bovine retention-of-placenta retention-of-placenta sulpha-trimethoprim sulpha-trimethoprim mineral-vitamin-liquid-mixture
1 text1 1 1 1
just quanteda's tokens_ngrams
using the example vector:
tokens_ngrams(tokens(example), n = 2, concatenator = " ")
tokens from 1 document.
text1 :
[1] "bovine retention-of-placenta" "retention-of-placenta sulpha-trimethoprim"
[3] "sulpha-trimethoprim mineral-vitamin-liquid-mixture"
Edit:
To get a vector of your terms, you could use one of the other convert options and use the $vocab to get the terms.
convert(qdfm, "lda")$vocab
[1] "bovine retention-of-placenta" "retention-of-placenta sulpha-trimethoprim"
[3] "sulpha-trimethoprim mineral-vitamin-liquid-mixture"
Tidy data.frame:
tidytext has a tidy
function to transform data from diverse packages into a tidy form. Both quanteda and tm are included. So after getting the data into a dfm, you can use tidy to get the data into a tibble. After that remove all columns you are not interested in with the usual dplyr syntax.
tidy(qdfm)
# A tibble: 3 x 3
document term count
<chr> <chr> <dbl>
1 text1 bovine retention-of-placenta 1
2 text1 retention-of-placenta sulpha-trimethoprim 1
3 text1 sulpha-trimethoprim mineral-vitamin-liquid-mixture 1
end edit:
with tm:
library(tm)
NLPBigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
corp <- VCorpus(VectorSource(example))
dtm <- DocumentTermMatrix(corp, control=list(tokenize = NLPBigramTokenizer))
inspect(dtm)
<<DocumentTermMatrix (documents: 1, terms: 3)>>
Non-/sparse entries: 3/0
Sparsity : 0%
Maximal term length: 50
Weighting : term frequency (tf)
Sample :
Terms
Docs bovine retention-of-placenta retention-of-placenta sulpha-trimethoprim sulpha-trimethoprim mineral-vitamin-liquid-mixture
1 1 1 1