Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1 1 1 0 1
D2 1 0 2 1

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions

votes

0 answers

missing words in tdm, using konlp, R

I'm currently preprocessing korean corpus using KoNLP, in R. library(stringr) library(tm) library(KoNLP) library(dplyr) library(rJava) useNIADic() myfunc_extract <- function(doc){ doc <- as.character(doc) doc2 <- paste(SimplePos22(doc)) …

r corpus term-document-matrix korean-nlp

asked Oct 12 '22 at 01:23

K.K.SAN

votes

0 answers

topicmodels has inverted functions $topics and $terms. Is it reliable?

I have a vector of strings (which represent preprocessed documents) on which I want to estimate an LDA model through R. I use functions in the topicmodels library. For the purpose of making reproduction of the problem easy, I create a vector with…

r lda term-document-matrix topicmodels

asked Aug 25 '22 at 18:33

Thomas GF

votes

1 answer

How to create an efficient term-document matrix from bag-of-words dataset

I am experimenting with UCI Bag of Words Dataset. I have read document IDs, words (word IDs), and word counts into three separate lists. The first 10 items of those lists are similar to what is below: ['1', '1', '1', '1', '1', '2', '2', '2', '3',…

python pandas term-document-matrix

asked May 21 '22 at 09:25

Find Mind

votes

1 answer

Sparse Matrix as a result of crossprod of sparse matrices

I have been working around this problem for a while without finding a satisfactory solution. I have data in a binary sparse matrix (TermDocumentMatrix) with dim ([1] 340436 763717). I here use an extract as proof of concept: m = structure(list(i =…

r sparse-matrix term-document-matrix

asked Apr 22 '22 at 17:04

KArrow'sBest

votes

1 answer

row_sums vs findFreqTerms for subsetting TermDocMatrix to include words with a given min frequency

my question is straightforward. I have a (binary) TDM and I want to reduce the number of rows to include only those rows that appear in at least two documents: I thought that these two methods would produce the same result in a binary matrix: >…

r bigdata sparse-matrix tm term-document-matrix

asked Apr 22 '22 at 10:29

KArrow'sBest

votes

1 answer

PySpark UDF: a fir transform example

I am really new to PySpark and am trying to translate some python code into pyspark. I start with a panda, convert to a document - term matrix and then apply PCA. The UDF: class MultiLabelCounter(): def __init__(self, classes=None): …

pyspark user-defined-functions term-document-matrix

asked Mar 18 '22 at 14:15

laila

1,009
3
15
27

votes

1 answer

Complex structure of Term-Document Matrix

I am quite new to R, sorry if my question will trivial. I try to work with clouds of words. The function comparison.cloud is supposed to accept a Term-Document Matrix with words' frequencies matrix built like that: head(term.matrix,1) …

r matrix term-document-matrix

asked Feb 14 '22 at 12:35

jback

votes

1 answer

TermDocumentMatrix Error after Cleaning Corpus

My problem is that I want to pass my corpus to the tm function termdocumentmatrix() and it fails with the error: Error in UseMethod("meta", x): no applicable method for meta' applied to an object of class "character". To begin with, I have a…

r tm corpus term-document-matrix

asked Dec 15 '21 at 12:35

Mauras

votes

1 answer

How can I prevent words with hyphens from being tokenized when using scikit-learn`s term document matrix?

I am currently working with a large corpus of articles (around 205 thousand), which require the construction of a term document matrix. I have looked around and it seems that sklearn offers an efficient way to construct it. However, when applying…

python scikit-learn nlp term-document-matrix

asked Oct 29 '21 at 13:40

Thomas GF

votes

1 answer

R: Converting Tibbles to a Term Document Matrix

I am using the R programming language. I learned how to take pdf files from the internet and load them into R. For example, below I load 3 different books by Shakespeare into R: library(pdftools) library(tidytext) library(textrank) library(tm) #1st…

r text nlp text-mining term-document-matrix

asked Apr 09 '21 at 06:21

stats_noob

5,401
4
27
83

votes

1 answer

Find frequency of specific words for individual documents in corpus - R, TermDocumentMatrix, TM

For a research project I am working on, I have read pdf documents into R, created a corpus and a TermDocumentMatrix. I want to check the frequency of specific words in each document in my corpus. The code below gives me the kind of matrix I want,…

r tm corpus word-frequency term-document-matrix

asked Jul 08 '20 at 03:52

Sarah R Hall

votes

1 answer

How to remove both Roman numbers and Arabic numbers in TermDocumentMatrix()?

In TermDocumentMatrix(), parameter removeNumbers=TRUE removes Arabic numbers in an English corpus. How can I remove both Roman numerals (such as "iii", "xiv" and "xiii", and in any case) and Arabic numbers? What custom function can I provide…

r tm term-document-matrix

asked May 25 '20 at 19:31

Tim

votes

1 answer

Applying LSA on term document matrix when number of documents are very less

I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine…

numpy nlp svd term-document-matrix lsa

asked Apr 12 '20 at 02:17

Parth

2,682
1
20
39

votes

1 answer

The 'dictionary' parameter of TermDocumentMatrix does not work in R

Even though I added the keyword to 'dictionary' as below code, it doesn't extract from the sentence. Sample code library(tm) data = c('a', 'a b', 'c') keyword = c('a', 'b') data = VectorSource(data) corpus = VCorpus(data) tdm =…

r text-mining term-document-matrix

asked Sep 11 '19 at 07:39

pss

votes

1 answer

Is there any possibility to cut a long vector of output in to specific pieces and save them in different cells in excel?

I just started to use Python. Actually, I'm setting up a new methodology to read patent data. With textrazor this patent data should be analyzed. I'm interested in getting the topics and save them in a term-document-matrix. It's already possible…

python vector term-document-matrix

asked Aug 22 '19 at 08:02

Dogan Kirhan

Prev 1 2 3

…

10 11 Next