2

I am trying to create a dfm of letters from strings. I am facing issues when the dfm is unable to pick on can create features for punctuations such as "/" "-" "." or '.

require(quanteda)
dict = c('a','b','c','d','e','f','/',".",'-',"'")
dict <- quanteda::dictionary(sapply(dict, list))

x<-c("cab","baa", "a/de-d/f","ad")
x<-sapply(x, function(x) strsplit(x,"")[[1]])
x<-sapply(x, function(x) paste(x, collapse = " "))

mat <- dfm(x, dictionary = dict, valuetype = "regex")
mat <- as.matrix(mat)
mat
  1. For "a/de-d/f", I want to capture the letters "/" "-" too
  2. Why is the "." feature acting as a rowsum. How can I keep it as individual feature?
SuperSatya
  • 65
  • 1
  • 6
  • Like `tokens <- tokenize(x, what = "character"); mat <- dfm(tokens, dictionary = dict, valuetype = "fixed")`? In a regular expression ("regex"), `.` stands for any character. – lukeA Nov 20 '16 at 02:32
  • Thanks. This is exactly what I was looking for. – SuperSatya Nov 20 '16 at 05:47

1 Answers1

0

The problem (as @lukeA points out in a comment) is that your valuetype is using the wrong pattern match. You are using a regular expression where the . stands for any character, and hence here is getting you a total (what you call a rowsum).

Let's first look at x, which will be tokenised on the whitespace by dfm(), so that each character becomes a token.

x
#        cab               baa          a/de-d/f                ad 
#    "c a b"           "b a a" "a / d e - d / f"             "a d" 

To answer (2) first, you are getting the following with a "regex" match:

dfm(x, dictionary = dict, valuetype = "regex", verbose = FALSE)
## Document-feature matrix of: 4 documents, 10 features.
## 4 x 10 sparse Matrix of class "dfmSparse"
##           features
## docs       a b c d e f / . - '
##   cab      1 1 1 0 0 0 0 3 0 0
##   baa      2 1 0 0 0 0 0 3 0 0
##   a/de-d/f 1 0 0 2 1 1 0 5 0 0
##   ad       1 0 0 1 0 0 0 2 0 0

That's close, but does not answer (1). To solve that, you need to alter the default tokenisation behaviour by dfm() so that it does not remove punctuation.

dfm(x, dictionary = dict, valuetype = "fixed", removePunct = FALSE, verbose = FALSE)
## Document-feature matrix of: 4 documents, 10 features.
## 4 x 10 sparse Matrix of class "dfmSparse"
##           features
## docs       a b c d e f / . - '
##   cab      1 1 1 0 0 0 0 0 0 0
##   baa      2 1 0 0 0 0 0 0 0 0
##   a/de-d/f 1 0 0 2 1 1 2 0 1 0
##   ad       1 0 0 1 0 0 0 0 0 0

and now the / and - are being counted. The . and ' remain present as features because they were dictionary keys, but have a zero count for every document.

Ken Benoit
  • 14,454
  • 27
  • 50
  • Thanks. I already got it fixed with just `valuetype = "fixed"` argument and without removPunct. I guess it doesnt matter as It was catching on all punctuations anyway. – SuperSatya Nov 21 '16 at 05:31