Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1 1 1 0 1
D2 1 0 2 1

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions

votes

1 answer

R: import pdf and create TermDocumentMatrix with files names as id

I'm importing pdf to R in order to do some text analysis. I have a number of pdf files whose names are their publication year (one publication per year). I would like to create a TermDocumentMatrix after importing them for which the first term…

r pdf import tm term-document-matrix

asked Feb 24 '19 at 21:43

Colombe

votes

1 answer

R convert dataframe to term-document-matrix

I'm currently learning my ways around R and Im troubled by the following problem: Ive got a dataframe that is build up like this word freq1 freq2 tree 10 20 this 2 3 that 4 …

r dataframe term-document-matrix

asked Jan 23 '19 at 09:50

Hunterofdark91

votes

1 answer

structure of DTM

> str(dtm) List of 6 $ i : int [1:128403] 1 1 1 1 1 1 1 1 1 1 ... $ j : int [1:128403] 1 2 3 4 5 6 7 8 9 10 ... $ v : num [1:128403] 1 1 1 1 2 1 1 1 1 1 ... $ nrow : int 500 $ ncol : int 8330 Could anyone please tell…

r tm term-document-matrix

asked Oct 12 '18 at 13:40

Qweuuu Qqqwer

votes

0 answers

Remove the most and the least appearing terms from a Term Document Matrix in R

I'm reading a Korean text file and trying to remove the most appearing terms(stopwords) and the least appearing terms from a Term Document Matrix which is generated in R. From the code below I'm able to get the TDM, but it has weights for all the…

r nlp term-document-matrix

asked Jul 31 '18 at 12:49

Kailash Sharma

votes

1 answer

build term document matrix from PDF file

I am trying to build term document matrix from one pdf text. When I inspect the term document matrix, I get this. <> The number of document should 1 not 342, and 342 is number of pages in pdf files.…

r pdf information-retrieval tm term-document-matrix

asked Apr 16 '18 at 13:14

Hilfit19

votes

0 answers

How to remove empty documents from a Term-Document-Matrix in R

So I created a term document matrix from a corpus in R: tdm_tfidf <-TermDocumentMatrix(corpus,control=list(weighting=weightTfIdf)) However, there is a warning that the TDM contains empty documents: Warning: In weighting(x) : empty document(s): 54…

r text-mining corpus term-document-matrix

asked Mar 31 '18 at 13:18

Lucinho91

votes

0 answers

Adding RegEx to specify character ngrams for a corpus in R

I'm having trouble using a RegEx on a corpus. I read in a couple of text documents that I converted to a corpus. I want to display it in a TermDocumentMatrix after some pre-processing. First I want to specify them with the RegEx "(\b([a-z]*)\B)".…

r regex text-mining corpus term-document-matrix

asked Mar 18 '18 at 13:13

J.B.

votes

1 answer

Word Term Matrix

I would love to create a Word matrix from some tweets, each word from the tweet has to be a new variable and be filled with 1 for only the words that correspond to that text in the tweet x <- data.frame("Tweet" = c("hi all","I need help"), "N" = 1,…

r twitter tm tidy term-document-matrix

asked Jan 19 '18 at 16:05

Suanbit

votes

0 answers

Creating a Matrix with DocumentTermMatrix from the strings in a column in R (text-mining)

The dataset looks as follows: sentiment<-c(1, 1, 0) review<-c("review1", "review2", "review3") #insert some reviews here daamx<-data.frame(sentiment, review) The sentiment column is a value that says if the review is positive or negative (1 or 0),…

r text-mining tm term-document-matrix

asked Jan 02 '18 at 11:56

VBApros

votes

2 answers

Why my Term Document Matrix has letters missing at end?

enter image description hereI'm working on creating a word cloud. On creation I see many words having last alphabets missing. For ex., Movie --> movi, become --> becom I've marked the words in yellow. the last one or two letters are missing

r stemming term-document-matrix

asked Nov 26 '17 at 04:54

Aniket Ghorpade

votes

1 answer

Find top features by Id(Contains Multiple documents with same Id) from DTM

I am using package tm. I have a dataframe with 2 columns, the first column is ID and the seocnd column contains text. The dataframe looks as follows. Id Text 13456 Hi, Good morning 13457 How are you? 13456 May I know who I am speaking…

r text-mining tm term-document-matrix

asked Nov 20 '17 at 09:38

Bhavya

votes

1 answer

Extract top features by frequency per document from a dtm in R

I have a dtm and want to extract the top 5 terms by frequency for each document from the document term matrix. I have a dtm built using the tm package Terms Docs aaaa aac abrt abused accept accepted 1 0 0 0 0 0 0 2 0 0 0 0 0…

r text-mining tm term-document-matrix

asked Nov 16 '17 at 11:42

Bhavya

votes

1 answer

How to tokenize text using punctuation as boundaries (Python)

I'm using CountVectorizer from sklearn to do text tokenization (2-gram) and create a term-document matrix. How can I tokenize text into 2-grams with punctuation as boundaries? For example, the input sentence is "this is example, with punctuation." I…

python tokenize term-document-matrix

asked Sep 15 '17 at 09:09

Yichi Liu

votes

1 answer

TM DocumentTermMatrix gives results which are unexpected given the corpus

Maybe I misinterpret how tm::DocumentTermMatrix works. I have a corpus which after preprocessing looks like this: head(Description.text, 3) [1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung…

r text-mining tm term-document-matrix

asked Jul 28 '17 at 17:27

Bakaburg

3,165
4
32
64

votes

1 answer

R Text Mining - Converting Term Document Matrix

I created a list of bigrams using: BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) tdm_a.bigram = TermDocumentMatrix(docs_a, control = list(tokenize = BigramTokenizer)) I am trying to…

r text-mining tm term-document-matrix rweka

asked Jul 07 '17 at 15:23

Sir Oliver

Prev 1 2 3

…

10 11 Next