Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
0
votes
1 answer

R: import pdf and create TermDocumentMatrix with files names as id

I'm importing pdf to R in order to do some text analysis. I have a number of pdf files whose names are their publication year (one publication per year). I would like to create a TermDocumentMatrix after importing them for which the first term…
0
votes
1 answer

R convert dataframe to term-document-matrix

I'm currently learning my ways around R and Im troubled by the following problem: Ive got a dataframe that is build up like this word freq1 freq2 tree 10 20 this 2 3 that 4 …
0
votes
1 answer

structure of DTM

> str(dtm) List of 6 $ i : int [1:128403] 1 1 1 1 1 1 1 1 1 1 ... $ j : int [1:128403] 1 2 3 4 5 6 7 8 9 10 ... $ v : num [1:128403] 1 1 1 1 2 1 1 1 1 1 ... $ nrow : int 500 $ ncol : int 8330 Could anyone please tell…
0
votes
0 answers

Remove the most and the least appearing terms from a Term Document Matrix in R

I'm reading a Korean text file and trying to remove the most appearing terms(stopwords) and the least appearing terms from a Term Document Matrix which is generated in R. From the code below I'm able to get the TDM, but it has weights for all the…
Kailash Sharma
  • 51
  • 1
  • 2
  • 7
0
votes
1 answer

build term document matrix from PDF file

I am trying to build term document matrix from one pdf text. When I inspect the term document matrix, I get this. <> The number of document should 1 not 342, and 342 is number of pages in pdf files.…
0
votes
0 answers

How to remove empty documents from a Term-Document-Matrix in R

So I created a term document matrix from a corpus in R: tdm_tfidf <-TermDocumentMatrix(corpus,control=list(weighting=weightTfIdf)) However, there is a warning that the TDM contains empty documents: Warning: In weighting(x) : empty document(s): 54…
Lucinho91
  • 175
  • 2
  • 4
  • 16
0
votes
0 answers

Adding RegEx to specify character ngrams for a corpus in R

I'm having trouble using a RegEx on a corpus. I read in a couple of text documents that I converted to a corpus. I want to display it in a TermDocumentMatrix after some pre-processing. First I want to specify them with the RegEx "(\b([a-z]*)\B)".…
J.B.
  • 13
  • 6
0
votes
1 answer

Word Term Matrix

I would love to create a Word matrix from some tweets, each word from the tweet has to be a new variable and be filled with 1 for only the words that correspond to that text in the tweet x <- data.frame("Tweet" = c("hi all","I need help"), "N" = 1,…
Suanbit
  • 471
  • 1
  • 4
  • 12
0
votes
0 answers

Creating a Matrix with DocumentTermMatrix from the strings in a column in R (text-mining)

The dataset looks as follows: sentiment<-c(1, 1, 0) review<-c("review1", "review2", "review3") #insert some reviews here daamx<-data.frame(sentiment, review) The sentiment column is a value that says if the review is positive or negative (1 or 0),…
VBApros
  • 1
  • 1
0
votes
2 answers

Why my Term Document Matrix has letters missing at end?

enter image description hereI'm working on creating a word cloud. On creation I see many words having last alphabets missing. For ex., Movie --> movi, become --> becom I've marked the words in yellow. the last one or two letters are missing
0
votes
1 answer

Find top features by Id(Contains Multiple documents with same Id) from DTM

I am using package tm. I have a dataframe with 2 columns, the first column is ID and the seocnd column contains text. The dataframe looks as follows. Id Text 13456 Hi, Good morning 13457 How are you? 13456 May I know who I am speaking…
Bhavya
  • 3
  • 2
0
votes
1 answer

Extract top features by frequency per document from a dtm in R

I have a dtm and want to extract the top 5 terms by frequency for each document from the document term matrix. I have a dtm built using the tm package Terms Docs aaaa aac abrt abused accept accepted 1 0 0 0 0 0 0 2 0 0 0 0 0…
Bhavya
  • 3
  • 2
0
votes
1 answer

How to tokenize text using punctuation as boundaries (Python)

I'm using CountVectorizer from sklearn to do text tokenization (2-gram) and create a term-document matrix. How can I tokenize text into 2-grams with punctuation as boundaries? For example, the input sentence is "this is example, with punctuation." I…
Yichi Liu
  • 23
  • 1
  • 5
0
votes
1 answer

TM DocumentTermMatrix gives results which are unexpected given the corpus

Maybe I misinterpret how tm::DocumentTermMatrix works. I have a corpus which after preprocessing looks like this: head(Description.text, 3) [1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung…
Bakaburg
  • 3,165
  • 4
  • 32
  • 64
0
votes
1 answer

R Text Mining - Converting Term Document Matrix

I created a list of bigrams using: BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) tdm_a.bigram = TermDocumentMatrix(docs_a, control = list(tokenize = BigramTokenizer)) I am trying to…
Sir Oliver
  • 57
  • 8