Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1 1 1 0 1
D2 1 0 2 1

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions

votes

2 answers

R: Finding frequency per term -- Warning Message

I have attempting to find the frequency per term in Martin Luther King's "I Have a Dream" speech. I have converted all uppercase letters to lowercase and I have removed all stop words. I have the text in a .txt file so I cannot display it on here.…

r frequency tm corpus term-document-matrix

asked Oct 19 '15 at 19:40

mapleleaf

votes

1 answer

How to store Sparsity and Maximum term length of a Term document matrix from tm

how to store the sparsity and maximum term length of Term Document Matrix in separate variable in R while finding ngrams ? library(tm) library(RWeka) #stdout <- vector('character') #con <- textConnection('stdout','wr',local = TRUE) #reading the…

r nlp tm term-document-matrix

asked Oct 07 '15 at 16:06

PradhanKamal

votes

1 answer

Why are stopwords not filtered out in `tm` corporized term-document matrices?

I'm building a term-document matrix using the tm library. # Create corpus. corporize <- function(dir_to_corporize) { crp <- Corpus(DirSource(dir_to_corporize, mode="text", encoding="ASCII"), readerControl=list(reader=readPlain,…

r tm term-document-matrix

asked Aug 19 '15 at 20:45

TMOTTM

3,286
6
32
63

votes

2 answers

How to split out numeric vector of bigrams from TDM matrix

I have a Large numeric (46201 elements, 3.3 Mb) in R. tdm_pairs.matrix <- as.matrix(tdm_pairs) top_pairs <- colSums(tdm_pairs.matrix) head(sort(top_pairs, decreasing = T),2) i know i dont i think i can i just i want 46 42 41…

r vector n-gram term-document-matrix

asked Jul 26 '15 at 00:46

jKraut

2,325
6
35
48

votes

2 answers

Creating TermDocumentMatrix: issue with number of documents

I'm attempting to create a term document matrix with a text file that is about 3+ million lines of text. I have created a random sample of the text, which results in about 300,000 lines. Unfortunately when use the following code I end up with…

r statistics nlp tm term-document-matrix

asked Jul 15 '15 at 08:38

statsguyz

votes

3 answers

How to create wordclouds for text files in a directory in R

I am trying to create a wordcloud for each text file in a directory. They are four presidential announcement speeches. I keep getting the following message: > cname <- file.path("C:", "texts") > cname [1] "C:/texts" > cname <-…

r text-mining word-cloud term-document-matrix quanteda

asked May 11 '15 at 05:28

Bernice Sturdivant

votes

0 answers

DocumentTermMatrix function not considering all the terms of the corpus in R

I am new to tm package in R. I am running the following code on corpus but output of DocumentTermMatrix is not considering all the terms. corpus = Corpus(VectorSource(text)) corpus = tm_map(corpus, PlainTextDocument) dtm =…

r text-processing tm corpus term-document-matrix

asked Apr 30 '15 at 04:33

Lal

votes

1 answer

How to build a termdocumentmatrix in R

I was wondering if it's possible to build a TermdocumentMatrix without using the package tm. I was thinking about using two for loops in combination with a grep, but unfortunately i did not manage to create something useful. matrix <- matrix(,…

r matrix binary term-document-matrix

asked Mar 16 '15 at 17:34

Olivier Thierie

votes

1 answer

My DocumentTermMatrix reduces to Zero columns

train <- read.delim('train.tsv', header= T, fileEncoding= "windows-1252",stringsAsFactors=F) Train.tsv contains 1,56,060 lines of text with 4 column names Phrase, PhraseID, SentenceID and Sentiment(on scale of 0 to 4).Phrase column has the text…

r text-mining tm term-document-matrix

asked Jan 31 '15 at 05:35

Avneesh047

votes

1 answer

Given TermDocumentMatrix, How can I convert it to numeric matrix?

I have already generated a termDocumentMatrix as showed below: > tmm [[1]] <> Non-/sparse entries: 11956/75992 Sparsity : 86% Maximal term length: 25 Weighting : term frequency…

r term-document-matrix

asked Jul 17 '14 at 16:33

user3746295

votes

1 answer

Unused Argument error in R using tm for word frequency matrix?

I'm new to programming and R. I'm trying to use the wordfish function in the Austin package. I created a term document matrix from a corpus but cannot successfully use the wordfish command: library(tm) library(austin) …

r package word-frequency term-document-matrix

asked Jun 13 '14 at 20:31

user3738982

votes

1 answer

Create Term Document Matrix of Bi Grams?

I am doing text mining on large data set. I was able to create TDM and DTM and was able to perform my analysis using TDF & IDF. But can we create a Term Document Matrix or Document Term Matrix for Bi Grams in R? I know similar facility is available…

r matrix nlp text-mining term-document-matrix

asked May 14 '14 at 06:34

Tanveer

votes

1 answer

Including all tokens in the term-document matrix in the R tm package

I'm trying to make a term-document matrix with the TermDocumentMatrix function of the tm package in R and found that some words are not included. > library(tm) > tdm <- TermDocumentMatrix(Corpus(VectorSource("The book is of great importance."))) >…

r tm term-document-matrix

asked Jan 31 '14 at 16:01

Akira Murakami

votes

2 answers

Generate Term-Document matrix using Lucene 4.4

I'm trying to create Term-Document matrix for a small corpus to further experiment with LSI. However, I couldn't find a way to do it with Lucene 4.4. I know how to get TermVector for each document as following: //create boolean query to search for a…

java lucene term-document-matrix

asked Oct 08 '13 at 14:38

chepukha

2,371
3
28
40

votes

2 answers

Read a term-document matrix from csv using python

The reason classic csv reader doesn't work on term-document arrays is that the first column of the csv file are terms, not values. Thus the file has the following syntax: "";"label1";"label2";"label3"…

python csv term-document-matrix large-data

asked May 08 '13 at 17:06

stelios

2,679
5
31
41

Prev 1 2 3

…

11 Next