0

I am quite new to R, sorry if my question will trivial. I try to work with clouds of words. The function comparison.cloud is supposed to accept a Term-Document Matrix with words' frequencies matrix built like that:

head(term.matrix,1)
      Docs
Terms SOTU 2010 SOTU 2011
  ’ll         3         8

colnames(term.matrix)
[1] "SOTU 2010" "SOTU 2011"

I try to build such a matrix myself but I am confused why "Terms" is not considered as a column name and why "Docs" is above the two column names "SOTU 2010" and "SOTU 2011"...

Can someone explain me that please?

s__
  • 9,270
  • 3
  • 27
  • 45
jback
  • 11
  • 1

1 Answers1

0

The dimnames attribute of a matrix, if not NULL, is a list of the form list(rownames, colnames) storing the row names and column names of the matrix.

x <- matrix(1:9, 3L, 3L)
x
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

dimnames(x) <- list(letters[1:3], LETTERS[1:3])
x
##   A B C
## a 1 4 7
## b 2 5 8
## c 3 6 9

Sometimes, it is convenient for the list itself to have names. These names act somewhat like axis titles:

names(dimnames(x)) <- c("lo", "UP")
x
##    UP
## lo  A B C
##   a 1 4 7
##   b 2 5 8
##   c 3 6 9

lo is printed on the same line as the column names, but it is really the title of the first dimension. Similarly, UP is the title of the second dimension.

TermDocumentMatrix and DocumentTermMatrix objects are not true R matrices. They store nonzero elements in triplet format for efficiency, as well as some metadata. However, like true R matrices, they can have a dimnames attribute. Since the rows and columns represent terms and documents (or vice versa), package tm assigns names Terms and Docs to the dimnames.

Taking an example from vignette("tm"):

library("tm")
reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- VCorpus(DirSource(reut21578, mode = "binary"), 
                   readerControl = list(reader = readReut21578XMLasPlain))
tdm <- TermDocumentMatrix(reuters)
str(tdm)
## List of 6
##  $ i       : int [1:2255] 14 35 49 157 202 203 233 274 290 291 ...
##  $ j       : int [1:2255] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:2255] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 1266
##  $ ncol    : int 20
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:1266] "..." "\"(it)" "\"demand" "\"expansion" ...
##   ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

Hence:

y <- as.matrix(tdm)[1:6, 1:6]
y
##             Docs
## Terms        127 144 191 194 211 236
##   ...          0   0   0   0   0   0
##   "(it)        0   0   0   0   0   0
##   "demand      0   1   0   0   0   0
##   "expansion   0   0   0   0   0   0
##   "for         0   0   0   0   0   0
##   "growth      0   0   0   0   0   0

dimnames(y)
## $Terms
## [1] "..."         "\"(it)"      "\"demand"    "\"expansion" "\"for"       "\"growth"   
## 
## $Docs
## [1] "127" "144" "191" "194" "211" "236"
Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48