Filter a TermDocumentMatrix with a dictionary of regular expressions

Question

I feel like this should be fairly easy. I have a dictionary of terms that are currently in the format of globs, which I have converted to regular expressions. The reason I've converted them to regular expressions is because I think the tm package works only with them. That's fine. But I cannot figure out how to subset a termDocumentMatrix by passing multiple dictionary terms. The other twist to this is that the dictionary terms have multiple lengths, some are 1, some are 2, some are 3 words long.

The following is my current code.

#load libraries
library(tm)
library(stringi)
#Load corpus crude part of tm package
data(crude)
#make tokenizer to account for multi-word dictionaries
myTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = " "), 
use.names =   FALSE)
#make TermDocumentMatrix
tdm<-TermDocumentMatrix(crude, control=list(tokenizer=myTokenizer))
#Make dictionary of regular expressions
dict<-c('^also$', '^told reuters$', '^an emergency$', '^in world oil$')
#This is what I am working with
inspect(
tdm[sapply(dict, function(x) stri_detect_regex(tdm$dimnames$Terms,    
pattern=x)),]
)

Putting dollar signs at the end of a regex pattern implies that there will only be a match if the words preceding it are at the very end of a character value. The additional caret implies that only an exact match will be recognized. You may want to remove both those markers. — IRTFM, Jun 16 '16 at 15:18
OK, I can play around with that, but what about the core question of supplying multiple regular expressions to filter a TermDocumentMatrix — spindoctor, Jun 16 '16 at 15:24

score 1 · Answer 1 · answered Jun 16 '16 at 16:07

I now find that the crude dataset is part of one of those packages which does allow testing. This shows that removing the carets and dollar signs from the patterns allows a much larger number of items to be found that match the targets:

> sum( sapply(dict, grepl, x=tdm$dimnames$Terms))
[1] 4
> dict2<-c('also', 'told reuters', 'an emergency', 'in world oil')
> sum( sapply(dict2, grepl, x=tdm$dimnames$Terms))
[1] 51

You can see which ones are matches if you use grep. (the results from grepl would be 4 timesd as long as tdm$dimnames$Terms :

> sapply(dict2, grep, x=tdm$dimnames$Terms)
$also
 [1]  707  708  738  739  740  741  742  743  744  745  746  747  748  749  750  751  752  753
[19]  754 1485 1486 2434 2881 2882 2988 2989 3399 3400 3782 3983 5265 5995 6088 6382 6383 6893
[37] 7427 7428 7524 7525 7605

$`told reuters`
[1] 3013 7209 7210

$`an emergency`
[1]  779  780  781 2437 2642 4205

$`in world oil`
[1] 3276

The print method for TDM's is not particularly informative, but you can "explode" the value with dput to see what is inside:

> dput(tdm[ sapply(dict2, grepl, x=tdm$dimnames$Terms), ] )
structure(list(i = c(1L, 2L, 3L, 8L, 9L, 33L, 3L, 16L, 17L, 20L, 
21L, 32L, 3L, 6L, 7L, 22L, 39L, 40L, 3L, 14L, 15L, 36L, 37L, 
38L, 3L, 12L, 13L, 27L, 28L, 41L, 3L, 10L, 11L, 25L, 26L, 30L, 
3L, 4L, 5L, 23L, 24L, 31L, 3L, 4L, 5L, 23L, 24L, 31L, 3L, 18L, 
19L, 29L, 34L, 35L), j = c(6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 
7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 10L, 10L, 10L, 10L, 14L, 14L, 14L, 14L, 14L, 14L, 16L, 16L, 
16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 
18L, 18L, 18L), v = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    nrow = 41L, ncol = 20L, dimnames = structure(list(Terms = c("ali also", 
    "ali also delivered", "also", "also called", "also called for", 
    "also contributed", "also contributed to", "also delivered", 
    "also delivered \"a", "also denied", "also denied that", 
    "also nigerian", "also nigerian oil", "also no", "also no projection", 
    "also reviews", "also reviews the", "also was", "also was lowered", 
    "but also", "but also reviews", "european weekend also", 
    "group, also", "group, also called", "he also", "he also denied", 
    "is also", "is also nigerian", "louisiana sweet also", "meeting.\" he also", 
    "private group, also", "sector, but also", "sheikh ali also", 
    "sweet also", "sweet also was", "there was also", "was also", 
    "was also no", "weekend also", "weekend also contributed", 
    "who is also"), Docs = c("127", "144", "191", "194", "211", 
    "236", "237", "242", "246", "248", "273", "349", "352", "353", 
    "368", "489", "502", "543", "704", "708")), .Names = c("Terms", 
    "Docs"))), .Names = c("i", "j", "v", "nrow", "ncol", "dimnames"
), class = c("TermDocumentMatrix", "simple_triplet_matrix"), weighting = c("term frequency", 
"tf"))

Filter a TermDocumentMatrix with a dictionary of regular expressions

1 Answers1