R text mining how to segment document into phrases not terms

Question

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to segment document into phases but not word(term).

Do you want to segment into pre-defined phrases, or just all n-length adjacent combinations (such as all bigrams)? — Ken Benoit, Apr 18 '16 at 20:48
ye, I want to segment documents into pre-defined phrases, which is our defined dictionary. The defined dictionary contains "semantic distance", "machine learning" et.al — Fiona_Wang, Apr 19 '16 at 05:21
dictionary function in the quanteda package is a list, i need to change 'semantic distance' into 'semantic_distance' as a list key, which match to 'semantic distance'. — Fiona_Wang, Apr 19 '16 at 07:45
That's a very different question than the above, I suggest you post a new question about how to match dictionaries whose values consist of white-space separated values. There are workarounds but quanteda's dictionary functions currently only work with single-token values. (Working on adding multiple token values however!) — Ken Benoit, Apr 19 '16 at 11:53
thanks, i have posted a new question http://stackoverflow.com/questions/36732659/r-construct-document-term-matrix-how-to-match-dictionaries-whose-values-consist — Fiona_Wang, Apr 20 '16 at 02:20

Ken Benoit · Answer 1 · 2017-12-13T16:52:17.240

You can do this in R using the quanteda package, which can detect multi-word expressions as statistical collocates, which would be the multi-word expressions that you are probably referring to in English. To remove the collocations containing stop words, you would first tokenise the text, then remove the stop words leaving a "pad" in place to prevent false adjacencies in the results (two words that were not actually adjacent before the removal of stop words between them).

require(quanteda)

pres_tokens <- 
    tokens(data_corpus_inaugural) %>%
    tokens_remove("\\p{P}", padding = TRUE, valuetype = "regex") %>%   
    tokens_remove(stopwords("english"), padding = TRUE)

pres_collocations <- textstat_collocations(pres_tokens, size = 2)

head(pres_collocations)
#          collocation count count_nested length   lambda        z
# 1      united states   157            0      2 7.893307 41.19459
# 2             let us    97            0      2 6.291128 36.15520
# 3    fellow citizens    78            0      2 7.963336 32.93813
# 4    american people    40            0      2 4.426552 23.45052
# 5          years ago    26            0      2 7.896626 23.26935
# 6 federal government    32            0      2 5.312702 21.80328

# convert the corpus collocations into single tokens, for top 1,500 collocations
pres_compounded_tokens <- tokens_compound(pres_tokens, pres_collocations[1:1500])

tokens_select(pres_compounded_tokens[2], "*_*")
# tokens from 1 document.
# 1793-Washington :
# [1] "called_upon"    "shall_endeavor" "high_sense"     "official_act"

Using this "compounded" token set, we can now turn this into a document-feature matrix where the features consist of a mixture of original terms (those not found in a collocation) and the collocations. As can be seen below, "united" occurs alone and as part of the collocation "united_states".

pres_dfm <- dfm(pres_compounded_tokens)
head(pres_dfm[1:5, grep("united|states", featnames(pres_dfm))])
# Document-feature matrix of: 5 documents, 10 features (86% sparse).
# 5 x 10 sparse Matrix of class "dfm"
#                  features
# docs              united states statesmen statesmanship reunited unitedly devastates statesman confederated_states united_action
#   1789-Washington      4      2         0             0        0        0          0         0                   0             0
#   1793-Washington      1      0         0             0        0        0          0         0                   0             0
#   1797-Adams           3      9         0             0        0        0          0         0                   0             0
#   1801-Jefferson       0      0         0             0        0        0          0         0                   0             0
#   1805-Jefferson       1      4         0             0        0        0          0         0                   0             0

If you want a more brute-force approach, it's possible simply to create a document-by-bigram matrix this way:

# just form all bigrams
head(dfm(data_inaugural_corpus, ngrams = 2))
## Document-feature matrix of: 57 documents, 63,866 features.
## (showing first 6 documents and first 6 features)
##                  features
## docs              fellow-citizens_of of_the the_senate senate_and and_of the_house
##   1789-Washington                  1     20          1          1      2         2
##   1797-Adams                       0     29          0          0      2         0
##   1793-Washington                  0      4          0          0      1         0
##   1801-Jefferson                   0     28          0          0      3         0
##   1805-Jefferson                   0     17          0          0      1         0
##   1809-Madison                     0     20          0          0      2         0

'dfm' function couldn't convert features into equivalence classes defined by phrases of a dictionary object. — Fiona_Wang, Apr 19 '16 at 09:15
Currently the dictionary functions (e.g. `applyDictionary()` which is also called by `dfm(x, dictionary = yourDictionary)` does not work with dictionaries whose values consist of white-space separated tokens. — Ken Benoit, Apr 19 '16 at 11:51

R text mining how to segment document into phrases not terms

1 Answers1