Questions tagged [text2vec]

text2vec - R package which provides a fast and memory efficient framework for text mining applications within R. Vectorization, word embeddings, topic modelling and more.

text2vec goal is to provide tools to easily perform text mining in R with C++ speeds:

Core parts written in C++
Small memory footprint
Concise, pipe friendly API
No need load all data into RAM - process it in chunks
Easily vertical scaling with multiple cores, threads.

See development page at github.

111 questions

votes

0 answers

Error: attempt to apply non-function in text2vec

I am trying to replicate the example given in the following link https://cran.r-project.org/web/packages/text2vec/vignettes/glove.html. I have unzipped the file manually. I am getting the following error at this stage library(text2vec) wiki =…

r text2vec

asked Jul 27 '18 at 15:02

NinjaR

votes

2 answers

Use a pre trained model with text2vec?

I would like to use a pre trained model with text2vec. My understanding was that the benefit here is that these models have been trained on a huge volume of data already, e.g. Google News Model. Reading the text2vec documentation it looks like the…

r nlp word2vec text2vec

asked May 28 '18 at 15:18

Doug Fir

19,971
47
169
299

votes

4 answers

A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by…

r text-mining tm quanteda text2vec

asked Sep 08 '17 at 18:30

Jacek Kotowski

votes

1 answer

H20: how to use gradient boosting on textual data?

I am trying to implement a very simple ML learning problem, where I use text to predict some outcome. In R, some basic example would be: import some fake but funny text data library(caret) library(dplyr) library(text2vec) dataframe <- data_frame(id…

r apache-spark h2o sparklyr text2vec

asked Jun 14 '17 at 21:28

ℕʘʘḆḽḘ

18,566
34
128
235

votes

1 answer

Apply text2vec embeddings to new data

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having…

r text2vec

asked Feb 02 '17 at 21:20

user2047457

votes

1 answer

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The…

sparse-matrix n-gram google-books text2vec bigdata

asked Jan 25 '17 at 14:04

user3554004

1,044
9
24

votes

3 answers

Replace words in text2vec efficiently

I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this. For the later…

r text2vec

asked Jan 11 '17 at 09:33

David

9,216
4
45
78

votes

1 answer

Preparing word embeddings in text2vec R package

Based on the text2vec package's vignette, an example is provided to create word embedding.The wiki data is tokenized and then term co-occurrence matrix (TCM) is created which is used to create the word embedding using glove function provided in the…

r text2vec

asked Sep 15 '16 at 15:28

amitkb3

vote

0 answers

How to calculate the coherence score for a LDA model?

I want to use coherence and perplexity to decide the best K(number of topics) in topic modeling. The sample of my dataset is: doc_id <- c(1:20) date <- c(1901:1920) text <- c("contribut theori microscop microscop percept", "illumin apparatus…

r text-mining lda topic-modeling text2vec

asked Nov 20 '22 at 15:53

Emily

vote

1 answer

text2vec word embeddings : compound some tokens but not all

I am using {text2vec} word embeddings to build a dictionary of similar terms pertaining to a certain semantic category. Is it OK to compound some tokens in the corpus, but not all? For example, I want to calculate terms similar to “future…

nlp tokenize word-embedding text2vec

asked Oct 04 '20 at 11:59

scarlett rouge

vote

1 answer

Export R text2vec Vectors for use in Gensim in Python

I have created GloVE vectors in R previously using text2vec library. Is there any easy way to export these for use in Python where I have scripts to compare/contract with Gensim created word vectors? I know there is a specific word2vec c_format,…

python r gensim text2vec

asked Aug 18 '20 at 15:45

Jibril

vote

1 answer

gloVe fit function issue with text2vect package in r

I am new to gloVe word embeddings for nlp/deep learning models in R but I find them very useful. I am experiencing problems implementing the model in r. When I use correct constructor: glove <- GlobalVectors$new(word_vectors_szie = 50, vocabulary =…

r text2vec

asked Feb 11 '20 at 18:43

nigus21

vote

1 answer

Elbow/knee in a curve in R

I've got this data processing: library(text2vec) ##Using perplexity for hold out set t1 <- Sys.time() perplex <- c() for (i in 3:25){ set.seed(17) lda_model2 <- LDA$new(n_topics = i) doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, …

r plot text2vec perplexity

asked Oct 28 '19 at 23:57

MelaniaCB

vote

0 answers

Pre-initialize weights in glove use Initial parameter in glove text2vec fit_transform

I would like to pre-initialise glove, word vectors and biases using the initial parameter of the fit_transform. The documentation of the function states to pass as a named list "w_i, w_j, b_i, b_j" values - initial word vectors and biases. As a…

r text2vec glove

asked Oct 12 '19 at 09:37

Melt

vote

0 answers

Calculate aggrgeate cosine and Jaccard distance between two sets of documents

I collected a list of abstracts from online news websites and manually labelled them, by topic, using their original labels (e.g., politics, entertainment, sports, finance, etc.). Now I want to compare the similarity in word usage in abstracts…

r text-mining cosine-similarity text2vec

asked Sep 30 '19 at 12:52

Chris T.

1,699
7
23
45

Prev 1

3 4 5 6 7 8 Next