Questions tagged [text2vec]

text2vec - R package which provides a fast and memory efficient framework for text mining applications within R. Vectorization, word embeddings, topic modelling and more.

text2vec goal is to provide tools to easily perform text mining in R with C++ speeds:

  1. Core parts written in C++
  2. Small memory footprint
  3. Concise, pipe friendly API
  4. No need load all data into RAM - process it in chunks
  5. Easily vertical scaling with multiple cores, threads.

See development page at github.

111 questions
2
votes
0 answers

Error: attempt to apply non-function in text2vec

I am trying to replicate the example given in the following link https://cran.r-project.org/web/packages/text2vec/vignettes/glove.html. I have unzipped the file manually. I am getting the following error at this stage library(text2vec) wiki =…
NinjaR
  • 621
  • 6
  • 22
2
votes
2 answers

Use a pre trained model with text2vec?

I would like to use a pre trained model with text2vec. My understanding was that the benefit here is that these models have been trained on a huge volume of data already, e.g. Google News Model. Reading the text2vec documentation it looks like the…
Doug Fir
  • 19,971
  • 47
  • 169
  • 299
2
votes
4 answers

A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by…
Jacek Kotowski
  • 620
  • 16
  • 49
2
votes
1 answer

H20: how to use gradient boosting on textual data?

I am trying to implement a very simple ML learning problem, where I use text to predict some outcome. In R, some basic example would be: import some fake but funny text data library(caret) library(dplyr) library(text2vec) dataframe <- data_frame(id…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
2
votes
1 answer

Apply text2vec embeddings to new data

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having…
user2047457
  • 381
  • 4
  • 13
2
votes
1 answer

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The…
user3554004
  • 1,044
  • 9
  • 24
2
votes
3 answers

Replace words in text2vec efficiently

I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this. For the later…
David
  • 9,216
  • 4
  • 45
  • 78
2
votes
1 answer

Preparing word embeddings in text2vec R package

Based on the text2vec package's vignette, an example is provided to create word embedding.The wiki data is tokenized and then term co-occurrence matrix (TCM) is created which is used to create the word embedding using glove function provided in the…
amitkb3
  • 303
  • 4
  • 14
1
vote
0 answers

How to calculate the coherence score for a LDA model?

I want to use coherence and perplexity to decide the best K(number of topics) in topic modeling. The sample of my dataset is: doc_id <- c(1:20) date <- c(1901:1920) text <- c("contribut theori microscop microscop percept", "illumin apparatus…
Emily
  • 27
  • 4
1
vote
1 answer

text2vec word embeddings : compound some tokens but not all

I am using {text2vec} word embeddings to build a dictionary of similar terms pertaining to a certain semantic category. Is it OK to compound some tokens in the corpus, but not all? For example, I want to calculate terms similar to “future…
scarlett rouge
  • 339
  • 2
  • 7
1
vote
1 answer

Export R text2vec Vectors for use in Gensim in Python

I have created GloVE vectors in R previously using text2vec library. Is there any easy way to export these for use in Python where I have scripts to compare/contract with Gensim created word vectors? I know there is a specific word2vec c_format,…
Jibril
  • 967
  • 2
  • 11
  • 29
1
vote
1 answer

gloVe fit function issue with text2vect package in r

I am new to gloVe word embeddings for nlp/deep learning models in R but I find them very useful. I am experiencing problems implementing the model in r. When I use correct constructor: glove <- GlobalVectors$new(word_vectors_szie = 50, vocabulary =…
nigus21
  • 337
  • 2
  • 11
1
vote
1 answer

Elbow/knee in a curve in R

I've got this data processing: library(text2vec) ##Using perplexity for hold out set t1 <- Sys.time() perplex <- c() for (i in 3:25){ set.seed(17) lda_model2 <- LDA$new(n_topics = i) doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, …
MelaniaCB
  • 427
  • 5
  • 16
1
vote
0 answers

Pre-initialize weights in glove use Initial parameter in glove text2vec fit_transform

I would like to pre-initialise glove, word vectors and biases using the initial parameter of the fit_transform. The documentation of the function states to pass as a named list "w_i, w_j, b_i, b_j" values - initial word vectors and biases. As a…
Melt
  • 11
  • 1
1
vote
0 answers

Calculate aggrgeate cosine and Jaccard distance between two sets of documents

I collected a list of abstracts from online news websites and manually labelled them, by topic, using their original labels (e.g., politics, entertainment, sports, finance, etc.). Now I want to compare the similarity in word usage in abstracts…
Chris T.
  • 1,699
  • 7
  • 23
  • 45