Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
1
vote
1 answer

Generic way to avoid special characters in R

The following are a series of e mail subjects. DF- data.frame. Note I have imported this from an excel sheet. EmailSubject Buy the stunning new phone The game changer is here. Experience a phone ahead of its time. Thank You Chennai …
Vishnu Raghavan
  • 83
  • 1
  • 10
1
vote
2 answers

TermDocumentMatrix not working on corpus

Trying to load many email files and let R learn what's spam or ham. First, I created a corpus, I want to create a term document, I received an error. How to fix it? email_corpus <-…
Lin Ye
  • 11
  • 1
1
vote
0 answers

R tm - how to get sparsity of TermDocumentMatrix as a variable?

I have several large TermDocumentMatrices, which I'm trimming down to a more manageable size using the removeSparseTerms() function. One of the arguments I have to send this, of course, is sparse. Because the TDMs are all quite different, I'd like…
CrowsNose
  • 83
  • 1
  • 10
1
vote
1 answer

R, is there any way that make a termdocumentmatrix by using multiple core?

Hello. Is there any way that make a termdocumentmatrix by using mutiple cores, parallel processing ? Or to get more fast result, can i use some packages like parallel, h2o, or others? someone help me please. thanks.
1
vote
1 answer

Create Corpus using Python

I am new to Python, I have created one term document matrix using R, I wanted to learn how I can use Python to create same. I am reading text data from Description column available in data frame Res_Desc_Train. But not sure how can I use…
user3734568
  • 1,311
  • 2
  • 22
  • 36
1
vote
1 answer

how can I use TermDocumentMatrix for persian text in R?

I want to view term frequencies in documents, my documents contain Persian text. I used R as follows: keycorpus <- Corpus(DirSource("E:\\Sample\\farsi texts")) tm.matrix <- TermDocumentMatrix(keycorpus) View(as.matrix(tm.matrix)) Although this code…
M.Rabiei
  • 11
  • 2
1
vote
3 answers

adding a new document to the term document matrix for similarity calculations

So I am aware there are several methods for finding a most similar or say three most similar documents in a corpus of documents. I know there can be scaling issues, for now I have around ten thousand documents and have been running tests on a subset…
cardamom
  • 6,873
  • 11
  • 48
  • 102
1
vote
1 answer

Why am I missing the last letter in term document matrix?

I am new to R and I'm trying to create term document matrix with a csv file. But the results show that some of the words are missing the letter "e" in the end. How can I make the term document matrix showing the full words? It will be great if you…
Amelia
  • 11
  • 2
1
vote
1 answer

Document Term Matrix will not maintain decimal places of numbers

Before I updated my version of RStudio, everything worked great. With the update something has changed with Document Term Matrix in the 'tm' package. I want to create a dtm, but with numbers. For instance if I have a .csv with one column as shown…
Will Ebert
  • 21
  • 6
1
vote
1 answer

Not able to see single digit/letter as a term in after creating TermDocument Matrix

I used TermDocument Matrix in R, and documents(strings) include single letter words also. After using TermDocument Matrix, the terms do not include those single letter words, please suggest which control should I include as an input argument in…
vaibhav
  • 87
  • 2
  • 6
1
vote
2 answers

How to filter term document matrix based on frequency of occurrence of each term

I have a term document matrix. I wish to subset it and keep only those terms which have appeared more than a certain number of times, i.e the row sum should be greater than a specific number. Any quick way to achieve this? B.T.W, the matrix is huge.
NinjaR
  • 621
  • 6
  • 22
1
vote
2 answers

how to read and write TermDocumentMatrix in r?

I made wordcloud using a csv file in R. I used TermDocumentMatrix method in the tm package. Here is my code: csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE) Encoding(csvData$content) <- "UTF-8" # useSejongDic() - KoNLP…
S.Kang
  • 581
  • 2
  • 10
  • 28
1
vote
0 answers

Why can't I create a document term matrix?

I'm using R 3.3.0 and for some reason, I cannot create a DTM without receiving the error: Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "try-error" In addition: Warning messages: 1: In…
Amarins
  • 43
  • 1
  • 1
  • 5
1
vote
1 answer

Text mining with R: use of sub

I am on a project with R and I am starting to get my hands dirty with it. In the first part I try to clean the data of vector msg. But later when I build the termdocumentmatrix, these characters still appear. I would like to remove words with less…
Claudio
  • 63
  • 1
  • 1
  • 7
1
vote
1 answer

Empty term document matrix

I seem to run into a problem whenever I try to inspect my freq. words and associations. When I make the tdm I get this info: TermDocumentMatrix I can see I have plenty of terms to use, in plenty of documents. However! When I try to inspect the…