Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1 1 1 0 1
D2 1 0 2 1

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions

votes

2 answers

tm package: Output of findAssocs() in a matrix instead of a list in R

Consider the following list: library(tm) data("crude") tdm <- TermDocumentMatrix(crude) a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1)) How do I manage to have a data frame with all terms associated with these 3 words in the…

r matrix tm term-document-matrix

asked Sep 24 '14 at 03:59

Steven Beaupré

21,343
7
57
77

votes

2 answers

Frequency Per Term - R TM DocumentTermMatrix

I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them. Ideally, I would…

r tm term-document-matrix

asked Jan 20 '13 at 17:03

user1994952

votes

0 answers

R text mining package DocumentTermMatrix with a dictionary in the control list takes way too much memory

I have noticed that DocumentTermMatrix(myCorpus, control=list(dictionary=myDict)) consumes way more memory than DocumentTermMatrix(myCorpus) Why is this happening? Any leads? Here is the code snippet: library(tm) library(XML) source("MyXMLReader.r")…

r memory-management text-mining tm term-document-matrix

asked Jul 10 '11 at 22:49

Shivani Rao

votes

1 answer

Term document entropy calculation

Using dtm it is possible to take the term frequency. How is it possible or is there any easy way to calculate the entropy? It is giving higher weight to the terms with less frequency in some documents. entropy = 1 + (Σj pij log2(pij)/log2n pij =…

r term-document-matrix quanteda

asked Feb 10 '18 at 19:11

Airi

votes

1 answer

Maximal term length in Document Term Matrix

Imagine the following Document Term Matrix created by tm package: > frequencies <> Non-/sparse entries: 7693/112157 Sparsity : 94% Maximal term length: 10 Weighting : term frequency…

r nlp tm term-document-matrix

asked Jan 29 '18 at 12:44

ch.elahe

votes

1 answer

Use DocumentTermMatrix in R with 'dictionary' parameter

I want to use R for text classification. I use DocumentTermMatrix to return the matrix of word: library(tm) crude <- "japan korea usa uk albania azerbaijan" corps <- Corpus(VectorSource(crude)) dtm <- DocumentTermMatrix(corps) inspect(dtm) words <-…

r tm corpus term-document-matrix

asked Jun 20 '17 at 04:41

Izzur Zuhri

votes

1 answer

Create dfm step by step with quanteda

I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one…

r text-analysis term-document-matrix quanteda

asked Aug 13 '16 at 09:54

000andy8484

votes

1 answer

R text mining how to segment document into phrases not terms

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you…

r text-mining n-gram term-document-matrix quanteda

asked Apr 18 '16 at 09:24

Fiona_Wang

votes

3 answers

R - slowly working lapply with sort on ordered factor

Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory. sparseTDM <- function(vc){ id =…

r text-mining lapply corpus term-document-matrix

asked Apr 05 '15 at 23:37

Krzysztof Jędrzejewski

votes

1 answer

Using lapply on term document matrix to calculate word frequency

Given three TermDocumentMatrix, text1, text2 and text3, I'd like to calculate word frequency for each of them into a data frame and rbind all the data frames. Three are sample - I have hundreds in reality so I need to functionalize this. It's easy…

r lapply term-document-matrix

asked Mar 18 '15 at 19:40

vagabond

3,526
5
43
76

votes

2 answers

R and tm package: create a term-document matrix with a dictionary of one or two words?

Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams, as some of the keywords. Web Search: Being new to text-mining and the tm package in R, I went to the web to figure out how to do this. …

r tm n-gram term-document-matrix rweka

asked Jan 19 '15 at 20:33

b_ron_

votes

2 answers

Term document matrix and cosine similarity in Python

I have following situation that I want to address using Python (preferably using numpy and scipy): Collection of documents that I want to convert to a sparse term document matrix. Extract sparse vector representation of each document (i.e. a row in…

python numpy scipy term-document-matrix

asked Aug 07 '13 at 20:40

abhinavkulkarni

2,284
4
36
54

votes

1 answer

Is there any situation the TF-IDF is worse that using term-frequency vectors?

I am doing text classification now. Is there any situation the TF-IDF is worse that using term-frequency vectors? How to explain it? Thanks

nlp mahout tf-idf term-document-matrix

asked Apr 03 '13 at 16:14

Meng Zhang

votes

1 answer

Error: cannot allocate vector of size 38.3 Gb while creating a document term matrix

It's the first time that I post on Stackoverflow, I'm a student. I hope someone will be able to help me. I am trying to do sentiment analysis in R Studio and am facing vector size error: When I try to create a Document Term Matrix using this…

r vector size term-document-matrix

asked Mar 10 '21 at 22:29

Magnon

votes

1 answer

Creating a term frequency matrix from a Python Dataframe

I am doing some natural language processing on some twitter data. So I managed to successfully load and clean up some tweets and placed it into a data frame below. id text …

python scikit-learn nltk sklearn-pandas term-document-matrix

asked Mar 12 '19 at 03:43

greatFritz

Prev 1

…

10 11 Next