Questions tagged [text2vec]

text2vec - R package which provides a fast and memory efficient framework for text mining applications within R. Vectorization, word embeddings, topic modelling and more.

text2vec goal is to provide tools to easily perform text mining in R with C++ speeds:

  1. Core parts written in C++
  2. Small memory footprint
  3. Concise, pipe friendly API
  4. No need load all data into RAM - process it in chunks
  5. Easily vertical scaling with multiple cores, threads.

See development page at github.

111 questions
1
vote
1 answer

text2vec - Do topics' words update with new data?

I'm currently performing a topic modelling using LDA from text2vec package. I managed to create a dtm matrix and then apply LDA and its fit_transform method with n_topics=50. While looking at the top words from each topic, a question popped into my…
1
vote
1 answer

tokenizing a list doesn't work with UTF8

I extract some data from Oracle DB to do some text mining. My data is UTF8 and vocab can't handle it.…
parvij
  • 1,381
  • 3
  • 15
  • 31
1
vote
1 answer

LDA$new model constructor text2vec R package error: Error in .subset2(public_bind_env, "initialize")(...) : unused argument (...)

The error is: > lda_model = LDA$new(n_topics = 3, vocabulary = vocab, doc_topic_prior = 0.1, topic_word_prior = 0.01) Error in .subset2(public_bind_env, "initialize")(...) : unused argument (vocabulary = list(term = c("normal", "bobo", "lixo",…
1
vote
1 answer

Lemmatization using txt file with lemmes in R

I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan …
Jacek Kotowski
  • 620
  • 16
  • 49
1
vote
2 answers

Why do I get two different performances when creating Jaccard similarity matrix using two sparse matrices that seem to be the same

I'm confounded by a strange performance issue when I try to create a Jaccard similarity matrix using sim2() from text2vec package. I have a sparse matrix [210,000 x 500] for which I'd like to obtain Jaccard similarity matrix as mentioned above. When…
Ankhnesmerira
  • 1,386
  • 15
  • 29
1
vote
1 answer

R: how to add numeric variables to a sparse matrix?

Consider the following example library(text2vec) library(glmnet) library(dplyr) dataframe <- data_frame(id = c(1,2,3,4), text = c("this is a test", "this is another",'hello','what???'), value =…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
1
vote
1 answer

Can text2vec package split Chinese sentence?

How to set itoken in text2vec for spliting Chinese sentence? The example is for English! There are exsited Chinese word separation package: jieba etc. However, I want to use text2vec to do text clustering and LDA model. In addtion, how to do text…
cindy
  • 19
  • 2
1
vote
1 answer

How to get topic probability table from text2vec LDA

The LDA topic modeling in the text2vec package is amazing. It is indeed much faster than topicmodel However, I don't know how to get the probability of each document belongs to each topic as the example below: V1 V2 V3 V4 1 0.001025237…
Lucia
  • 615
  • 1
  • 9
  • 16
1
vote
1 answer

Write a text2vec dtm to a file (csv or svmlight)

I came across the text2vec package today and it's exactly what I need for a particular problem. However, I haven't been able to figure out how to export a dtm created with text2vec to some kind of output file. My ultimate goal is to generate…
Dave Kincaid
  • 3,970
  • 3
  • 24
  • 32
1
vote
2 answers

text2vec: Iterate over the vocabulary after using function create_vocabulary

Using text2vec package, I created a vocabulary. vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) vocab looks something like this > vocab Number of docs: 120 0 stopwords: ... ngram_min = 2; ngram_max = 2 Vocabulary: terms…
Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83
1
vote
1 answer

text2vec in R- Transform new data?

There is documentation on creating a DTM (document term matrix) for the text2vec package, for example the following where a TFIDF weighting is applied after building the matrix: data("movie_review") N <- 1000 it <- itoken(movie_review$review[1:N],…
B_Miner
  • 1,840
  • 4
  • 31
  • 66
0
votes
0 answers

Including a covariate in a word embedding model in R using text2vec and quanteda packages

I am trying to build a word embedding model in r with the following code: library(quanteda) library(text2vec) fcm_ <- fcm(tokens, context = "window", count = "weighted", weights = 1 / (1:5), tri = TRUE) glove <- GlobalVectors$new(rank = 50, x_max…
0
votes
1 answer

How can I hide messages in R markdown when "message=FALSE" doesn't work

I am using R Markdown and text2vec and would like to suppress the messages that come from running the function glove$fit_transform(). I've tried message=FALSE and warning=FALSE, as well as a number hacky attempts to fixing the problem, but to no…
generic
  • 302
  • 1
  • 3
  • 14
0
votes
1 answer

How can I solve my problems with the installation of the text2vec package?

I'm trying to install the R package text2vec, I get the following error. It says it cannot open a certain shared object file. > install.packages("text2vec") Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object…
Nina van Bruggen
  • 393
  • 2
  • 13
0
votes
0 answers

Viewing saved LDAvis plot from directory in browser

I created an LDAvis figure using the text2veec package in R. Tried but failed to save it to my local directory as the fully interactive webpage that it is. I get either a blank page in my browser or a static when save thee figure with the following…
nigus21
  • 337
  • 2
  • 11