Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
0
votes
0 answers

missing words in tdm, using konlp, R

I'm currently preprocessing korean corpus using KoNLP, in R. library(stringr) library(tm) library(KoNLP) library(dplyr) library(rJava) useNIADic() myfunc_extract <- function(doc){ doc <- as.character(doc) doc2 <- paste(SimplePos22(doc)) …
K.K.SAN
  • 11
  • 4
0
votes
0 answers

topicmodels has inverted functions $topics and $terms. Is it reliable?

I have a vector of strings (which represent preprocessed documents) on which I want to estimate an LDA model through R. I use functions in the topicmodels library. For the purpose of making reproduction of the problem easy, I create a vector with…
Thomas GF
  • 1
  • 2
0
votes
1 answer

How to create an efficient term-document matrix from bag-of-words dataset

I am experimenting with UCI Bag of Words Dataset. I have read document IDs, words (word IDs), and word counts into three separate lists. The first 10 items of those lists are similar to what is below: ['1', '1', '1', '1', '1', '2', '2', '2', '3',…
0
votes
1 answer

Sparse Matrix as a result of crossprod of sparse matrices

I have been working around this problem for a while without finding a satisfactory solution. I have data in a binary sparse matrix (TermDocumentMatrix) with dim ([1] 340436 763717). I here use an extract as proof of concept: m = structure(list(i =…
KArrow'sBest
  • 150
  • 9
0
votes
1 answer

row_sums vs findFreqTerms for subsetting TermDocMatrix to include words with a given min frequency

my question is straightforward. I have a (binary) TDM and I want to reduce the number of rows to include only those rows that appear in at least two documents: I thought that these two methods would produce the same result in a binary matrix: >…
KArrow'sBest
  • 150
  • 9
0
votes
1 answer

PySpark UDF: a fir transform example

I am really new to PySpark and am trying to translate some python code into pyspark. I start with a panda, convert to a document - term matrix and then apply PCA. The UDF: class MultiLabelCounter(): def __init__(self, classes=None): …
laila
  • 1,009
  • 3
  • 15
  • 27
0
votes
1 answer

Complex structure of Term-Document Matrix

I am quite new to R, sorry if my question will trivial. I try to work with clouds of words. The function comparison.cloud is supposed to accept a Term-Document Matrix with words' frequencies matrix built like that: head(term.matrix,1) …
jback
  • 11
  • 1
0
votes
1 answer

TermDocumentMatrix Error after Cleaning Corpus

My problem is that I want to pass my corpus to the tm function termdocumentmatrix() and it fails with the error: Error in UseMethod("meta", x): no applicable method for meta' applied to an object of class "character". To begin with, I have a…
Mauras
  • 1
  • 2
0
votes
1 answer

How can I prevent words with hyphens from being tokenized when using scikit-learn`s term document matrix?

I am currently working with a large corpus of articles (around 205 thousand), which require the construction of a term document matrix. I have looked around and it seems that sklearn offers an efficient way to construct it. However, when applying…
Thomas GF
  • 1
  • 2
0
votes
1 answer

R: Converting Tibbles to a Term Document Matrix

I am using the R programming language. I learned how to take pdf files from the internet and load them into R. For example, below I load 3 different books by Shakespeare into R: library(pdftools) library(tidytext) library(textrank) library(tm) #1st…
stats_noob
  • 5,401
  • 4
  • 27
  • 83
0
votes
1 answer

Find frequency of specific words for individual documents in corpus - R, TermDocumentMatrix, TM

For a research project I am working on, I have read pdf documents into R, created a corpus and a TermDocumentMatrix. I want to check the frequency of specific words in each document in my corpus. The code below gives me the kind of matrix I want,…
0
votes
1 answer

How to remove both Roman numbers and Arabic numbers in TermDocumentMatrix()?

In TermDocumentMatrix(), parameter removeNumbers=TRUE removes Arabic numbers in an English corpus. How can I remove both Roman numerals (such as "iii", "xiv" and "xiii", and in any case) and Arabic numbers? What custom function can I provide…
Tim
  • 1
  • 141
  • 372
  • 590
0
votes
1 answer

Applying LSA on term document matrix when number of documents are very less

I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine…
Parth
  • 2,682
  • 1
  • 20
  • 39
0
votes
1 answer

The 'dictionary' parameter of TermDocumentMatrix does not work in R

Even though I added the keyword to 'dictionary' as below code, it doesn't extract from the sentence. Sample code library(tm) data = c('a', 'a b', 'c') keyword = c('a', 'b') data = VectorSource(data) corpus = VCorpus(data) tdm =…
pss
  • 3
  • 5
0
votes
1 answer

Is there any possibility to cut a long vector of output in to specific pieces and save them in different cells in excel?

I just started to use Python. Actually, I'm setting up a new methodology to read patent data. With textrazor this patent data should be analyzed. I'm interested in getting the topics and save them in a term-document-matrix. It's already possible…