Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
0
votes
2 answers

R: Finding frequency per term -- Warning Message

I have attempting to find the frequency per term in Martin Luther King's "I Have a Dream" speech. I have converted all uppercase letters to lowercase and I have removed all stop words. I have the text in a .txt file so I cannot display it on here.…
mapleleaf
  • 758
  • 3
  • 8
  • 14
0
votes
1 answer

How to store Sparsity and Maximum term length of a Term document matrix from tm

how to store the sparsity and maximum term length of Term Document Matrix in separate variable in R while finding ngrams ? library(tm) library(RWeka) #stdout <- vector('character') #con <- textConnection('stdout','wr',local = TRUE) #reading the…
PradhanKamal
  • 540
  • 4
  • 18
0
votes
1 answer

Why are stopwords not filtered out in `tm` corporized term-document matrices?

I'm building a term-document matrix using the tm library. # Create corpus. corporize <- function(dir_to_corporize) { crp <- Corpus(DirSource(dir_to_corporize, mode="text", encoding="ASCII"), readerControl=list(reader=readPlain,…
TMOTTM
  • 3,286
  • 6
  • 32
  • 63
0
votes
2 answers

How to split out numeric vector of bigrams from TDM matrix

I have a Large numeric (46201 elements, 3.3 Mb) in R. tdm_pairs.matrix <- as.matrix(tdm_pairs) top_pairs <- colSums(tdm_pairs.matrix) head(sort(top_pairs, decreasing = T),2) i know i dont i think i can i just i want 46 42 41…
jKraut
  • 2,325
  • 6
  • 35
  • 48
0
votes
2 answers

Creating TermDocumentMatrix: issue with number of documents

I'm attempting to create a term document matrix with a text file that is about 3+ million lines of text. I have created a random sample of the text, which results in about 300,000 lines. Unfortunately when use the following code I end up with…
statsguyz
  • 419
  • 2
  • 11
  • 35
0
votes
3 answers

How to create wordclouds for text files in a directory in R

I am trying to create a wordcloud for each text file in a directory. They are four presidential announcement speeches. I keep getting the following message: > cname <- file.path("C:", "texts") > cname [1] "C:/texts" > cname <-…
0
votes
0 answers

DocumentTermMatrix function not considering all the terms of the corpus in R

I am new to tm package in R. I am running the following code on corpus but output of DocumentTermMatrix is not considering all the terms. corpus = Corpus(VectorSource(text)) corpus = tm_map(corpus, PlainTextDocument) dtm =…
Lal
  • 23
  • 5
0
votes
1 answer

How to build a termdocumentmatrix in R

I was wondering if it's possible to build a TermdocumentMatrix without using the package tm. I was thinking about using two for loops in combination with a grep, but unfortunately i did not manage to create something useful. matrix <- matrix(,…
Olivier Thierie
  • 161
  • 2
  • 11
0
votes
1 answer

My DocumentTermMatrix reduces to Zero columns

train <- read.delim('train.tsv', header= T, fileEncoding= "windows-1252",stringsAsFactors=F) Train.tsv contains 1,56,060 lines of text with 4 column names Phrase, PhraseID, SentenceID and Sentiment(on scale of 0 to 4).Phrase column has the text…
0
votes
1 answer

Given TermDocumentMatrix, How can I convert it to numeric matrix?

I have already generated a termDocumentMatrix as showed below: > tmm [[1]] <> Non-/sparse entries: 11956/75992 Sparsity : 86% Maximal term length: 25 Weighting : term frequency…
user3746295
  • 101
  • 2
  • 11
0
votes
1 answer

Unused Argument error in R using tm for word frequency matrix?

I'm new to programming and R. I'm trying to use the wordfish function in the Austin package. I created a term document matrix from a corpus but cannot successfully use the wordfish command: library(tm) library(austin) …
0
votes
1 answer

Create Term Document Matrix of Bi Grams?

I am doing text mining on large data set. I was able to create TDM and DTM and was able to perform my analysis using TDF & IDF. But can we create a Term Document Matrix or Document Term Matrix for Bi Grams in R? I know similar facility is available…
Tanveer
  • 890
  • 12
  • 22
0
votes
1 answer

Including all tokens in the term-document matrix in the R tm package

I'm trying to make a term-document matrix with the TermDocumentMatrix function of the tm package in R and found that some words are not included. > library(tm) > tdm <- TermDocumentMatrix(Corpus(VectorSource("The book is of great importance."))) >…
Akira Murakami
  • 463
  • 1
  • 4
  • 14
0
votes
2 answers

Generate Term-Document matrix using Lucene 4.4

I'm trying to create Term-Document matrix for a small corpus to further experiment with LSI. However, I couldn't find a way to do it with Lucene 4.4. I know how to get TermVector for each document as following: //create boolean query to search for a…
chepukha
  • 2,371
  • 3
  • 28
  • 40
0
votes
2 answers

Read a term-document matrix from csv using python

The reason classic csv reader doesn't work on term-document arrays is that the first column of the csv file are terms, not values. Thus the file has the following syntax: "";"label1";"label2";"label3"…
stelios
  • 2,679
  • 5
  • 31
  • 41
1 2 3
10
11