Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
2
votes
1 answer

How to get term-document matrix from multiple documents with Spark?

I'm trying to generete a term-document matrix from multiple documents. I could run LDA Model from a already created matrix, now I need this step back. Ive tried to implement a simple term-doc matrix, but now I'm stucked. What I did was: //GETS ALL…
2
votes
2 answers

R - comparing two corpuses to create a NEW corpus with words with higher frequency from corpus #1

I have two corpuses that contain similar words. similar enough that using setdiff doesn't really help my cause. So I've turned towards finding a way to extract a list or corpus (to eventually make a wordcloud) of words that are more frequent…
SpicyClubSauce
  • 4,076
  • 13
  • 37
  • 62
2
votes
2 answers

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and…
Ricky
  • 4,616
  • 6
  • 42
  • 72
2
votes
1 answer

R: clustering documents

I've got a documentTermMatrix that looks as follows: artikel naam product personeel loon verlof doc 1 1 1 2 1 0 0 doc 2 1 1 1 0 0 0 doc 3 0 0 1 1 …
Anita
  • 759
  • 1
  • 10
  • 23
2
votes
1 answer

findAssocs for multiple terms in R

In R I used the [tm package][1] for building a term-document matrix from a corpus of documents. My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking…
Grote
  • 37
  • 1
  • 3
1
vote
0 answers

Function Corpus in Quanteda doesn't work because of a kwic objects

First of all, I'm working on a big data project which consists in analyze some press URLs to detect the most popular topics. My topic is about football (Mbappe contract) and I collected 180 URLs from Marca, a Spanish media mass, in a .txt file. When…
Cristina
  • 23
  • 2
1
vote
0 answers

TermDocumentMatrix function stops executing in R / RStudio which is a prerequisite for Wordcloud function

I have been trying to execute TermDocumentMatrix function on my corpus of texts but R and R Studio gave me an Error: Error in tdm(txt, isTRUE(control$removePunctuation), isTRUE(control$removeNumbers), : function 'Rcpp_precious_remove' not…
Vida
  • 11
  • 2
1
vote
0 answers

How to represent a document from test set with Document-Term Matrix created from training data? (Latent Semantic Indexing)

I build a model of document classification from the training set of documents. Classification is done by the vector representation of each document, that is, a row in the Document-Term Matrix. Then to test the model, I need the representation of…
1
vote
0 answers

TermDocumentMatrix not responding to Tokenizer

I am very new to R and I am trying to do an NGram WordCloud. However, my results always show a 1Gram instead of an NGram. I have searched for days for answers on the web and tried different methods...still the same result. Also, for some reason, I…
RdR
  • 11
  • 2
1
vote
1 answer

Document term matrix function returning 0 when applying the document term matrix

I have a corpus of 600 text files that I want to extract from it every numerical combination after the term mim and create the document term matrix to find frequencies per file.. i used this code, it extracted all the wanted terms but it returns 0…
rachid rachid
  • 187
  • 12
1
vote
1 answer

Non English term document matrix

I have the following dataframe made of English and Hindi texts I want to read the hindi texts in R Click Percentage Email_Subject 18.12807882 तेजस्वी गैलेक्सी ए 7 (2016) बस 1856 रुपए प्रति माह से शुरू खरीदें 11.91957875 तेजस्वी…
Raghavan vmvs
  • 1,213
  • 1
  • 10
  • 29
1
vote
3 answers

format the number of digits in results R

I created a document term matrix that searches numbers from 100000 to 600000 for some data mining issues, but i mentioned that it doesn't take as results the wanted numbers it combine every numbers with spaces or decimal in a 6 digit combination…
stephan
  • 35
  • 6
1
vote
1 answer

Find frequency of a custom word in R TermDocumentMatrix using TM package

I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers. I then turned it into a TermDocumentMatrix and used the functions…
George
  • 317
  • 2
  • 4
  • 16
1
vote
1 answer

Add new document to term document matrix in R

I have term document matrix before and want to add new document to that term document matrix, in another way it can say to join two document matrix. My term document matrix is : Docs Term 1 eat 7 food 2 run 2 sick 3 Then another…
1
vote
1 answer

I am trying to create a DocumentTermMatrix while keeping all special characters

Trying to do some text mining with R without removing any special characters. For example in the following "LKC" and "LKC_" should be different words. Instead it is dropping the _ and making it the same word. How can I accomplish…
jz_
  • 338
  • 2
  • 14