Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
1
vote
1 answer

R: TermDocumentMatrix - Error while creating

I am trying to get twitter data and create a wordcloud but my code is giving error while creating TermDocumentMatrix. My code is as below twitter_search_data <- searchTwitter(searchString = text_to_search ,n =…
Main
  • 150
  • 1
  • 10
1
vote
0 answers

Text analytics in R

I have a large data set(460 Mb) which has a column - Log with 386551 rows. I wish to use clustering and N-Gram approach to form word cloud. My code is as follows: library(readr) AMC <- read_csv("All Tickets.csv") Desc <- AMC[,4] #Very large data…
Jason Born
  • 39
  • 7
1
vote
1 answer

Calculate word frequency in DataFrame

I am trying to create a dataframe where the first column ("Value") has a multi-word string in each row and all other columns have labels representing unique words from all strings in "Value". I want to populate this dataframe with the word frequency…
Toly
  • 2,981
  • 8
  • 25
  • 35
1
vote
1 answer

calculate term document matrix while looking for words within strings also

This question is related to to my earlier question. Treat words separated by space in the same manner Posting it as a separate one since it might help other users find it easily. The question is regarding the way the term document matrix is…
user3664020
  • 2,980
  • 6
  • 24
  • 45
1
vote
1 answer

Exclude outliers in colSums for Term Document Matrix in R

I created a Term Document Matrix, "myDtm", of a set of keywords contained in a large collections of patents. I want to obtain an ordered, kind of Top 100, list of patents with the highest frequency of keywords. The code lines are myDtm <-…
1
vote
3 answers

Creating a document term matrix in R

I need to create a documenttermmatrix for myself, my twitter followers and their followers. We need to create this without using the tm package. at the moment, we have the following variables: list l : containing all the followers' followers,…
Olivier Thierie
  • 161
  • 2
  • 11
1
vote
1 answer

TermDocumentMatrix as.matrix uses large amounts of memory

I'm currently using the tm package to extract out terms to cluster on for duplicate detection in a decently sized database of 25k items (30Mb) this runs on my desktop, but when I try to run it on my server It seems to take an ungodly amount of time.…
Matt Bucci
  • 2,100
  • 2
  • 16
  • 22
1
vote
1 answer

R build TermDocumentMatrix with removeSparseTerms parameter

Am I able to remove sparse terms WHILE creating a tm::TermDocumentMatrix object? I tried: TermDocumentMatrix(file.corp, control = list(removeSparseTerms=0.998)) but it does not work.
Marta Karas
  • 4,967
  • 10
  • 47
  • 77
1
vote
1 answer

R DocumentTermMatrix loses results less than 100

I'm trying to feed a corpus into DocumentTermMatrix (I shorthand as DTM) to get term frequencies, but I noticed that DTM doesn't keep all terms and I don't know why! Check it out: A<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106…
Amit Kohli
  • 2,860
  • 2
  • 24
  • 44
1
vote
0 answers

Converting a term document matrix into a tableau readable table

I have created a term document matrix using R tm package and exported it into a csv by converting it into a dataframe. Sample portion of the term document matrix: 1 10 12 14 15 16 17 century 0 4 0 0 1 5 3 pete 0 2 …
koder
  • 81
  • 3
  • 9
1
vote
1 answer

Making a term document matrix from an excel file using R

For sentiment analysis using tm plugin webmining, I am to create a TermDocumentMatrix, as shown in the code sample below: http://www.inside-r.org/packages/cran/tm/docs/tm_tag_score I have a csv file with headlines of articles on separate rows, in a…
user2976990
  • 21
  • 1
  • 3
1
vote
1 answer

R : Finding the top 10 terms associated with the term 'fraud' across documents in a Document Term Matrix in R

I have a corpus of 39 text files named by the year - 1945.txt, 1978.txt.... 2013.txt. I've imported them into R and created a Document Term Matrix using TM package. I'm trying to investigate how words associated with term'fraud' have changed over…
koder
  • 81
  • 3
  • 9
1
vote
2 answers

Creating a Term Document Matrix from Text File

I'm trying to read one text file and create a term document matrix using textmining packages. I can create term document matrix where I need to add each line by line. The problem is that I want to include whole file at a time. What am I missing in…
J4cK
  • 30,459
  • 8
  • 42
  • 54
1
vote
1 answer

What is the significance of covariance matrix constructed through term document matrix in PCA?

I'm working on neural networks and for reducing the dimensions of the term-document matrix constructed through documents and the various terms in it bearing the values of tf-idf , I need to apply PCA. Something Like this Term 1 …
Hooli
  • 711
  • 2
  • 13
  • 24
1
vote
2 answers

MATLAB nnmf() - large term-document matrix - memory and speed issue

I have a large term-document matrix and want to use the non-negative matrix factorization function matlab offers. The problem is that after the 1st iteration the memory usage rises rapidly and reaches the top (my system has 6GB), and on the other…
tgogos
  • 23,218
  • 20
  • 96
  • 128