Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
4
votes
2 answers

tm package: Output of findAssocs() in a matrix instead of a list in R

Consider the following list: library(tm) data("crude") tdm <- TermDocumentMatrix(crude) a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1)) How do I manage to have a data frame with all terms associated with these 3 words in the…
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
4
votes
2 answers

Frequency Per Term - R TM DocumentTermMatrix

I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them. Ideally, I would…
user1994952
  • 41
  • 1
  • 3
3
votes
0 answers

R text mining package DocumentTermMatrix with a dictionary in the control list takes way too much memory

I have noticed that DocumentTermMatrix(myCorpus, control=list(dictionary=myDict)) consumes way more memory than DocumentTermMatrix(myCorpus) Why is this happening? Any leads? Here is the code snippet: library(tm) library(XML) source("MyXMLReader.r")…
3
votes
1 answer

Term document entropy calculation

Using dtm it is possible to take the term frequency. How is it possible or is there any easy way to calculate the entropy? It is giving higher weight to the terms with less frequency in some documents. entropy = 1 + (Σj pij log2(pij)/log2n pij =…
Airi
  • 43
  • 5
3
votes
1 answer

Maximal term length in Document Term Matrix

Imagine the following Document Term Matrix created by tm package: > frequencies <> Non-/sparse entries: 7693/112157 Sparsity : 94% Maximal term length: 10 Weighting : term frequency…
ch.elahe
  • 289
  • 4
  • 18
3
votes
1 answer

Use DocumentTermMatrix in R with 'dictionary' parameter

I want to use R for text classification. I use DocumentTermMatrix to return the matrix of word: library(tm) crude <- "japan korea usa uk albania azerbaijan" corps <- Corpus(VectorSource(crude)) dtm <- DocumentTermMatrix(corps) inspect(dtm) words <-…
Izzur Zuhri
  • 33
  • 1
  • 6
3
votes
1 answer

Create dfm step by step with quanteda

I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one…
000andy8484
  • 563
  • 3
  • 16
3
votes
1 answer

R text mining how to segment document into phrases not terms

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you…
Fiona_Wang
  • 163
  • 1
  • 2
  • 12
3
votes
3 answers

R - slowly working lapply with sort on ordered factor

Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory. sparseTDM <- function(vc){ id =…
3
votes
1 answer

Using lapply on term document matrix to calculate word frequency

Given three TermDocumentMatrix, text1, text2 and text3, I'd like to calculate word frequency for each of them into a data frame and rbind all the data frames. Three are sample - I have hundreds in reality so I need to functionalize this. It's easy…
vagabond
  • 3,526
  • 5
  • 43
  • 76
3
votes
2 answers

R and tm package: create a term-document matrix with a dictionary of one or two words?

Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams, as some of the keywords. Web Search: Being new to text-mining and the tm package in R, I went to the web to figure out how to do this. …
b_ron_
  • 197
  • 1
  • 1
  • 10
3
votes
2 answers

Term document matrix and cosine similarity in Python

I have following situation that I want to address using Python (preferably using numpy and scipy): Collection of documents that I want to convert to a sparse term document matrix. Extract sparse vector representation of each document (i.e. a row in…
abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54
3
votes
1 answer

Is there any situation the TF-IDF is worse that using term-frequency vectors?

I am doing text classification now. Is there any situation the TF-IDF is worse that using term-frequency vectors? How to explain it? Thanks
Meng Zhang
  • 337
  • 1
  • 4
  • 13
2
votes
1 answer

Error: cannot allocate vector of size 38.3 Gb while creating a document term matrix

It's the first time that I post on Stackoverflow, I'm a student. I hope someone will be able to help me. I am trying to do sentiment analysis in R Studio and am facing vector size error: When I try to create a Document Term Matrix using this…
Magnon
  • 21
  • 1
2
votes
1 answer

Creating a term frequency matrix from a Python Dataframe

I am doing some natural language processing on some twitter data. So I managed to successfully load and clean up some tweets and placed it into a data frame below. id text …
1
2
3
10 11