Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
0
votes
1 answer

Build a document-term matrix from a list of documents, each of which is in list form

I wonder if there exists an elegant way to convert a list of documents to a document-term matrix. The motivation to do this is the need of subtle transformation on the terms from documents, i.e., stemming. the input data is…
Li haonan
  • 600
  • 1
  • 6
  • 24
0
votes
2 answers

Keep all the text phrases for data frequency

I have a data frame with only one column "text" "text" "User Interfaces" "Twitter" "Text Normalization" "Term weighting" "Teenagers" "Team member replacement" I would like to take a dataframe with the frequency of every phrase, like this: "User…
Keri
  • 375
  • 1
  • 3
  • 14
0
votes
1 answer

how to remove NA columns from TDM for clustering

I'm struggling with TDM NA values to commit the clustering. Initially I've set: titles.tdm <- as.matrix(TermDocumentMatrix(titles.cw, control = list(bounds = list(global = c(10,Inf))))) titles.sc <- scale(na.omit(titles.tdm)) and got matrix of 418…
Peter.k
  • 1,475
  • 23
  • 40
0
votes
1 answer

Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

I try to create a matrix, for this I would like to tolower text. For this I use this R instruction : matrix = create_matrix(tweets[,1], toLower = TRUE, language="english", removeStopwords=FALSE, removeNumbers=TRUE, …
Datackatlon
  • 199
  • 1
  • 4
  • 15
0
votes
1 answer

How to increase font size of TermDocumentMatrix plot on Rstudio?

I'm working on some tweets and using text mining techniques. I used the following command, and the plot is unreadable because the font size is so small. How can I fix it? plot(tdm, term = freq.terms, corThreshold = 0.95, ps=30)
0
votes
1 answer

Does tm automatically ignore the very short strings?

Here is my code: example 1: a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi") a1 <- VCorpus(VectorSource(a)) a2 <- TermDocumentMatrix(a1,control = list(stemming=T)) inspect(a2) The result is: Docs Terms 1 2 12v 0 1 a23 …
Feng Chen
  • 2,139
  • 4
  • 33
  • 62
0
votes
0 answers

Converting frequency table directly to TDM in R

I have the following data, which consists of "Frequency" information that I extracted using a Python Script. I want to use this information to generate a WordCloud2 in R. WORD VALUE SENT 1 topnotch 1 1 2 …
owwoow14
  • 1,694
  • 8
  • 28
  • 43
0
votes
0 answers

R: error creating a termDocumentMatrix() object

Here's my code that I used to create the termdocumentmatrix object for training data: text_train = iconv(data_train$SentimentText, "UTF-8", "ASCII", sub = "") corpus_train = Corpus(VectorSource(text_train)) tdm_train = TermDocumentMatrix( …
alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49
0
votes
1 answer

Error: inherits(doc, "TextDocument") is not TRUE

I am running the following code chunk tdm = TermDocumentMatrix(ctext,control=list(minWordLength=1)) print(tdm) inspect(tdm[10:20,11:18]) out = findFreqTerms(tdm,lowfreq=5) print(out) When I run it in console it runs fine. However when I include…
FlyingPickle
  • 1,047
  • 1
  • 9
  • 19
0
votes
2 answers

how to selected vocabulary in scikit CountVectorizer

I have used scikit CountVectorizer to convert collection of documents into matrix of token counts. I have also used its max_features which considers the top max_features ordered by term frequency across the corpus. Now I want to analyse my selected…
Shweta
  • 1,111
  • 3
  • 15
  • 30
0
votes
4 answers

keep words present in a given vector and remove others

I have a list of say, 10,000 strings (A). I also have a vector of words (V). What I want to do is to modify each string of A to keep only those words in the string which are present in V and remove others. For example, let's say first element of A…
user3664020
  • 2,980
  • 6
  • 24
  • 45
0
votes
0 answers

Counting the frequency of Numbers by Term Document Matrix in R

I started using tm package and have got a question in the Term Document Matrix function. I know that with this function, we can get the frequency of words across a set of documents. But I have noticed that it doesn't show the frequency for…
sachinv
  • 492
  • 2
  • 5
  • 18
0
votes
1 answer

apply a function to multiple Document Term Matrices

I have 5 Document Term Matrices, say, DTM1, DTM2, DTM3, DTM4, DTM5. Now I have written a function called myBarPlot(DTM, title, color) which accepts a DocumentTermMatrix and a Title (character) to each Plot and separate color for each plot. Now how…
0
votes
0 answers

TermDocumentMatrix in R does not function

I have a corpus look like this: My corpus, myCorpus1 has been included 33704 tweets. You can see it below of code. But when I create midterm matrix, which is TermDocument Matrix, there are only 3732 documents. My question is how the…
Denis
  • 1
  • 1
0
votes
1 answer

twitter data <- error in termdocumentmatrix

# search for a term in twitter rdmTweets <- searchTwitteR("machine learning", n=500, lang="en") dtm.control <- list( tolower = TRUE, removePunctuation = TRUE, removeNumbers = TRUE, removestopWords = TRUE, stemming = TRUE,…