0

I have a document term matrix in cluto format:

#Document #Term #TotalItem
term-x weight-x term-y weight-y (for only nonzeros terms, a row per document)

Instead of a corpus, I want to create DocumentTermMatrix(tm package) from this file, is this possible?

Cluto File:
2 3 3
1 3 3 4
2 8

Row File:
car
plane

Column File:
x
y
z

Solution:

dtm = as.DocumentTermMatrix(read_stm_CLUTO(file), weightTf);
rows <- scan("rows.txt", what="", sep="\n");
columns <- scan("columns.txt", what="", sep="\n");

dtm$dimnames = list(rows,columns);
metdos
  • 13,411
  • 17
  • 77
  • 120
  • 1
    How about this? `require(slam); as.DocumentTermMatrix(read_stm_CLUTO(file), weightTf)` – Ben Apr 02 '13 at 16:24
  • @Ben Perfect, could you type it as an answer, so I can accept it. Is there any way to pass row and column names? – metdos Apr 02 '13 at 17:08

1 Answers1

1

This should do it:

require(slam)
as.DocumentTermMatrix(read_stm_CLUTO(file), weightTf)

If you can link to your CLUTO file or an add an excerpt of it to your Q we can look at row and column names.

hat-tip: https://r-forge.r-project.org/scm/viewvc.php/pkg/R/foreign.R?root=tm&view=diff&r1=1127&r2=1127&diff_format=s

Ben
  • 41,615
  • 18
  • 132
  • 227
  • Looks like you've got col/row names sorted. You might do `dtm$dimnames = list(Docs = rows, Terms = columns)` – Ben Apr 02 '13 at 23:00