Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1 1 1 0 1
D2 1 0 2 1

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions

votes

1 answer

Build a document-term matrix from a list of documents, each of which is in list form

I wonder if there exists an elegant way to convert a list of documents to a document-term matrix. The motivation to do this is the need of subtle transformation on the terms from documents, i.e., stemming. the input data is…

python term-document-matrix

asked May 29 '17 at 08:56

Li haonan

votes

2 answers

Keep all the text phrases for data frequency

I have a data frame with only one column "text" "text" "User Interfaces" "Twitter" "Text Normalization" "Term weighting" "Teenagers" "Team member replacement" I would like to take a dataframe with the frequency of every phrase, like this: "User…

r term-document-matrix

asked May 12 '17 at 20:34

Keri

votes

1 answer

how to remove NA columns from TDM for clustering

I'm struggling with TDM NA values to commit the clustering. Initially I've set: titles.tdm <- as.matrix(TermDocumentMatrix(titles.cw, control = list(bounds = list(global = c(10,Inf))))) titles.sc <- scale(na.omit(titles.tdm)) and got matrix of 418…

r cluster-analysis term-document-matrix

asked Apr 30 '17 at 22:39

Peter.k

1,475
23
40

votes

1 answer

Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

I try to create a matrix, for this I would like to tolower text. For this I use this R instruction : matrix = create_matrix(tweets[,1], toLower = TRUE, language="english", removeStopwords=FALSE, removeNumbers=TRUE, …

r matrix text-processing tm term-document-matrix

asked Mar 22 '17 at 13:04

Datackatlon

votes

1 answer

How to increase font size of TermDocumentMatrix plot on Rstudio?

I'm working on some tweets and using text mining techniques. I used the following command, and the plot is unreadable because the font size is so small. How can I fix it? plot(tdm, term = freq.terms, corThreshold = 0.95, ps=30)

plot rstudio font-size term-document-matrix

asked Dec 25 '16 at 19:51

Kiran Masood

votes

1 answer

Does tm automatically ignore the very short strings?

Here is my code: example 1: a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi") a1 <- VCorpus(VectorSource(a)) a2 <- TermDocumentMatrix(a1,control = list(stemming=T)) inspect(a2) The result is: Docs Terms 1 2 12v 0 1 a23 …

r tm term-document-matrix

asked Nov 09 '16 at 02:39

Feng Chen

2,139
4
33
62

votes

0 answers

Converting frequency table directly to TDM in R

I have the following data, which consists of "Frequency" information that I extracted using a Python Script. I want to use this information to generate a WordCloud2 in R. WORD VALUE SENT 1 topnotch 1 1 2 …

r matrix dataframe term-document-matrix

asked Oct 17 '16 at 16:28

owwoow14

1,694
8
28
43

votes

0 answers

R: error creating a termDocumentMatrix() object

Here's my code that I used to create the termdocumentmatrix object for training data: text_train = iconv(data_train$SentimentText, "UTF-8", "ASCII", sub = "") corpus_train = Corpus(VectorSource(text_train)) tdm_train = TermDocumentMatrix( …

r term-document-matrix

asked May 10 '16 at 06:47

alwaysaskingquestions

1,595
5
22
49

votes

1 answer

Error: inherits(doc, "TextDocument") is not TRUE

I am running the following code chunk tdm = TermDocumentMatrix(ctext,control=list(minWordLength=1)) print(tdm) inspect(tdm[10:20,11:18]) out = findFreqTerms(tdm,lowfreq=5) print(out) When I run it in console it runs fine. However when I include…

r term-document-matrix

asked Mar 17 '16 at 06:33

FlyingPickle

1,047
1
9
19

votes

2 answers

how to selected vocabulary in scikit CountVectorizer

I have used scikit CountVectorizer to convert collection of documents into matrix of token counts. I have also used its max_features which considers the top max_features ordered by term frequency across the corpus. Now I want to analyse my selected…

python scikit-learn term-document-matrix

asked Mar 16 '16 at 20:21

Shweta

1,111
3
15
30

votes

4 answers

keep words present in a given vector and remove others

I have a list of say, 10,000 strings (A). I also have a vector of words (V). What I want to do is to modify each string of A to keep only those words in the string which are present in V and remove others. For example, let's say first element of A…

python for-loop corpus term-document-matrix

asked Feb 17 '16 at 12:10

user3664020

2,980
6
24
45

votes

0 answers

Counting the frequency of Numbers by Term Document Matrix in R

I started using tm package and have got a question in the Term Document Matrix function. I know that with this function, we can get the frequency of words across a set of documents. But I have noticed that it doesn't show the frequency for…

r tm term-document-matrix

asked Jan 20 '16 at 07:09

sachinv

votes

1 answer

apply a function to multiple Document Term Matrices

I have 5 Document Term Matrices, say, DTM1, DTM2, DTM3, DTM4, DTM5. Now I have written a function called myBarPlot(DTM, title, color) which accepts a DocumentTermMatrix and a Title (character) to each Plot and separate color for each plot. Now how…

r function for-loop tm term-document-matrix

asked Jan 20 '16 at 06:20

Sourav Ghosh

votes

0 answers

TermDocumentMatrix in R does not function

I have a corpus look like this: My corpus, myCorpus1 has been included 33704 tweets. You can see it below of code. But when I create midterm matrix, which is TermDocument Matrix, there are only 3732 documents. My question is how the…

r twitter tm corpus term-document-matrix

asked Dec 15 '15 at 04:19

Denis

votes

1 answer

twitter data <- error in termdocumentmatrix

# search for a term in twitter rdmTweets <- searchTwitteR("machine learning", n=500, lang="en") dtm.control <- list( tolower = TRUE, removePunctuation = TRUE, removeNumbers = TRUE, removestopWords = TRUE, stemming = TRUE,…

r twitter term-document-matrix

asked Nov 15 '15 at 16:49

user2241260

Prev 1 2 3

…

10 11 Next