Questions tagged [text2vec]

text2vec - R package which provides a fast and memory efficient framework for text mining applications within R. Vectorization, word embeddings, topic modelling and more.

text2vec goal is to provide tools to easily perform text mining in R with C++ speeds:

  1. Core parts written in C++
  2. Small memory footprint
  3. Concise, pipe friendly API
  4. No need load all data into RAM - process it in chunks
  5. Easily vertical scaling with multiple cores, threads.

See development page at github.

111 questions
0
votes
0 answers

Convert dgeMatrix for downstream tasks

I am trying to cluster sentence embeddings based on Glove model from text2vec. I generated the embeddings using the glove model like so (I create the iterator, vocab etc in the standard way). # create document term matrix dtm = create_dtm(it,…
0
votes
1 answer

error running glmnet on 2 combined DTMs (via cBind) in text2vec

I created a tf-idf DTM and a n-gram based DTM in text2vec, using the same dataset. now, i am able to run glmnet on each of them separately, but when i combine these 2 DTMs to via cBind, glmnet gives me an error: Error in validObject(.Object)…
Akhil
  • 165
  • 1
  • 1
  • 8
0
votes
1 answer

Sparse matrix in CSC format dgCMatrix in LiblineaR occurs error [R]

dtm_train_tfidf is a sparse matrix in CSC format dgCMatrix I am using the function LiblineaR which is supposed to accept sparse matrices. However when I use the sparse matrix dtm_train_tfidf, the following error occurs: library(LiblineaR) …
toumperlekis
  • 39
  • 1
  • 9
0
votes
1 answer

I have done TF-IDF and want to implement models in caret package [R]

I have implemented the TF-IDF algorithm that is explained in this link: https://cran.r-project.org/web/packages/text2vec/vignettes/text-vectorization.html#tf-idf So, the classifier is implemented like this: glmnet_classifier = cv.glmnet(x =…
toumperlekis
  • 39
  • 1
  • 9
0
votes
1 answer

How to use prepare_analogy_questions and check_analogy_accuracy functions in text2vec package?

Following code: library(text2vec) text8_file = "text8" if (!file.exists(text8_file)) { download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip") unzip ("text8.zip", files = "text8") } wiki = readLines(text8_file, n = 1, warn = FALSE) #…
0
votes
1 answer

Text preprocessing and topic modelling using text2vec package

I have a large number of documents and I want to do topic modelling using text2vec and LDA (Gibbs Sampling). Steps I need are as (in order): Removing numbers and symbols from the text library(stringr) docs$text <-…
Sam S.
  • 627
  • 1
  • 7
  • 23
0
votes
1 answer

In R text2vec package - LDA model can show the topic distribution for each tokens in document?

library (text2vec) library (parallel) library (doParallel) N <- parallel::detectCores() cl <- makeCluster (N) registerDoParallel (cl) Ky_young <- read.csv("./Ky_young.csv") IT <- itoken_parallel (Ky_young$TEXTInfo, ids …
유승환
  • 129
  • 1
  • 1
  • 10
0
votes
1 answer

The compatibility between text2vec and RHadoop

At present, we are using text2vec processing large dataset in AWS EC2(single instance), the text data will bigger and bigger in the future, we may try to RHadoop(MapReduce) architecture and don't know if it can be compatibility between text2vec and…
Zheng Lu
  • 3
  • 1
0
votes
1 answer

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages. These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I…
Jacek Kotowski
  • 620
  • 16
  • 49
0
votes
2 answers

How to produce document term matrix in text2vector only from stored list of words

What is the syntax in text2vec to vectorize texts and achieve dtm with only the indicated list of words? How to vectorize and produce document term matrix only on indicated features? And if the features do not appear in the text the variable should…
Jacek Kotowski
  • 620
  • 16
  • 49
0
votes
1 answer

Text2Vec classification with caret - Naive Bayes warning message

Please see the question listed here for more context. I attempting to use a document term matrix, built using text2vec, to train a naive bayes (nb) model using the caret package. However, I get this warning message: Warning message: In eval(xpr,…
UbuntuNewbie
  • 29
  • 1
  • 5
0
votes
1 answer

Text2Vec classification with caret SVM warning message

I am working on a text classification problem with the text2vec package and caret. I am using text2vec to build a document-term matrix before building different models with caret. The goal is to identify string similarity between two strings, using…
UbuntuNewbie
  • 29
  • 1
  • 5
0
votes
0 answers

text2vec tfidf fails in R with odd message

I encountered an odd issue when I try to use tf-idf on my corpus. Here is my code: prep_fun <- function(x) { x %>% # make text lower case str_to_lower %>% # remove non-alphanumeric symbols str_replace_all("<.*?>", " ")…
Zakkery
  • 420
  • 4
  • 11
0
votes
1 answer

Plotting the effect of document pruning on text corpus in R text2vec

Is it possible to check how many documents remain in the corpus after applying prune_vocabulary in the text2vec package? Here is an example for getting a dataset in and pruning vocabulary library(text2vec) library(data.table) library(tm) #Load…
sriramn
  • 2,338
  • 4
  • 35
  • 45
0
votes
1 answer

Compute unweighted bag-of-words based TCM using text2vec in R?

I am trying to compute a term-term co-occurrence matrix (or TCM) from a corpus using the text2vec package in R (since it has a nice parallel backend). I followed this tutorial, but while inspecting some toy examples, I noticed the create_tcm…
user3554004
  • 1,044
  • 9
  • 24