Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
2
votes
0 answers

Non-English Term-Document Matrix

I want to construct a Term-Document matrix in python for Arabic language, I used CountVectorizer(), but it gives the documents in the column, and terms in the row. I want the terms to be in columns and documents in rows, I tried to transpose the…
2
votes
1 answer

Normalizing bag of words data in Gensim

I am using gensim to create a bag of words model and I want to perform normalization. I found the documentation (https://radimrehurek.com/gensim/models/normmodel.html), but I am confused as to how to implement that given the code I have.…
Jane Sully
  • 3,137
  • 10
  • 48
  • 87
2
votes
3 answers

Create Frequency table using R and Term document Matrix

I have created the following dataframe consisting of a few e-mail subject lines. df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone', 'Offer ! Buy New phone and get earphone at 1000. Limited…
Raghavan vmvs
  • 1,213
  • 1
  • 10
  • 29
2
votes
1 answer

TermDocumentMatrix in R - only 1-grams created

I just started with tm package in R and cannot seem to overcome an issue. Even though my tokenizer functions seem to work right: uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1)) biTokenizer <- function(x) NGramTokenizer(x,…
2
votes
1 answer

Text mining Error- Getting this error while creating the DocumentTermMatrix & Word Cloud

I am getting an error message 'Error in simple_triplet_matrix(i, j, v, nrow = length(terms), ncol = length(corpus), : 'i, j' invalid' While creating the DocumentTermMatrix or creating a Word Cloud.This is happening in all data sets.Here is the…
2
votes
1 answer

Converting a Spark dataframe to Term document matrix in R using sparklyr

I have a code in R which needs to be scaled to use big data. I am using Spark for this and the package that seemed most convenient was sparklyr. However, I am unable to create a TermDocument matrix from a Spark dataframe. Any help would be…
NinjaR
  • 621
  • 6
  • 22
2
votes
1 answer

tm_map(gsub...) fails to replace words

# Loading required libraries # Set up logistics such as reading in data and setting up corpus ```{r} # Relative path points to the local folder folder.path="../data/InauguralSpeeches/" # get the list of file names speeches=list.files(path =…
user101998
  • 241
  • 5
  • 15
2
votes
1 answer

R: DocumentTermMatrix Wrong Frequencies after mgsub

I have a DocumentTermMatrix and I´d like to replace specific terms in this document and to create a frequency table. The starting point is the original document as follows: library(tm) library(qdap) df1 <- data.frame(word =c("test", "test",…
OAM
  • 179
  • 1
  • 14
2
votes
2 answers

R: Natural Language Processing on Support Vector Machine-TermDocumentMatrix

I have started working on a project which requires Natural Language Processing and building a model on Support Vector Machine (SVM) in R. I’d like to generate a Term Document Matrix with all the tokens. Example: testset <- c("From month 2 the AST…
2
votes
2 answers

Importing a TermDocumentMatrix into R

I am working on a qualitative analysis project in the tm package of R. I have built a corpus and created a term document matrix and long story short I need to edit my term document matrix and conflate some of its rows. To do this I have exported it…
lrampe
  • 21
  • 1
2
votes
1 answer

R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you…
Fiona_Wang
  • 163
  • 1
  • 2
  • 12
2
votes
1 answer

"Difference" among Document Term Matrices

Suppose I have a set of 100 documents, 70 speaking of politics and 30 speaking of math (a weird combination, I know that). My goal is to represent them on xy throught methods like the multidimensional scaling analysis, network analyses, som, etc.…
Andrea Ianni
  • 829
  • 12
  • 24
2
votes
1 answer

How to get the Term-Doc frequency from many fields combined?

I have written an index with lucene, from a collection of documents. My documents have 2 fields and were added to the index like so: Document doc = new Document(); doc.add(new TextField("Title", "I am a title", Field.Store.NO)); doc.add(new…
dimitris93
  • 4,155
  • 11
  • 50
  • 86
2
votes
2 answers

R How do i keep punctuation with TermDocumentMatrix()

I have a large dataframe where I am identifying patterns in strings and then extracting them. I have provided a small subset to illustrate my task. I am generating my patterns by creating a TermDocumentMatrix with multiple words. I use these…
CallumH
  • 751
  • 1
  • 7
  • 22
2
votes
1 answer

In R plotting keyword / word associations (findAssocs) with igraph on tdm or dtm in R?

I'd like to create a term network analysis plot based on certain word associations in R but I don't know how to go beyond plotting a entire Term Document Matrix: # Network analysis library(igraph) # load tdm data # create matrix Neg.m <-…
Robert
  • 510
  • 1
  • 5
  • 23
1 2
3
10 11