Questions tagged [text2vec]

text2vec - R package which provides a fast and memory efficient framework for text mining applications within R. Vectorization, word embeddings, topic modelling and more.

text2vec goal is to provide tools to easily perform text mining in R with C++ speeds:

  1. Core parts written in C++
  2. Small memory footprint
  3. Concise, pipe friendly API
  4. No need load all data into RAM - process it in chunks
  5. Easily vertical scaling with multiple cores, threads.

See development page at github.

111 questions
1
vote
1 answer

Join doc_topic_distr with DTM raw data using doc_id

I want to try some kind of prediction stuff similar to this one: https://www.quora.com/How-do-I-use-LDA-Latent-Dirichlet-Allocation-for-document-classification-preferably-with-solutions-that-can-be-implemented-in-R I think that I will have to merge…
Flocke Haus
  • 55
  • 1
  • 6
1
vote
1 answer

Why do fit_transform and transform produce different results?

I was playing around with LDA in the text2vec package and was confused why the fit_transfrom and transform were different when using the same data. The documentation states that transform applys the learned model to new data but the result is a lot…
1
vote
1 answer

Read GloVe pre-trained embeddings into R, as a matrix

Working in R. I know the pre-trained GloVe embeddings (e.g., "glove.6B.50d.txt") can be found here: https://nlp.stanford.edu/projects/glove/. However, I've had zero luck reading this text file into R so that the product is the word embedding matrix…
Drew
  • 135
  • 4
  • 11
1
vote
2 answers

Why is LSA in text2vec producing different results every time?

I was using latent semantic analysis in the text2vec package to generate word vectors and using transform to fit new data when I noticed something odd, the spaces not being lined up when trained on the same data. There appears to be some…
user3554004
  • 1,044
  • 9
  • 24
1
vote
0 answers

Relaxed Word Mover's Distance in R

I am using Relaxed Word Mover's Distance in the package text2vec to compute the distance between documents, so as to identify the most similar document for each target document. Word vectors are compiled using FastText available in the pacakage…
TMC
  • 11
  • 1
1
vote
1 answer

Using GLOVEs pretrained glove.6B.50.txt as a basis for word embeddings R

I'm trying to convert textual data into vectors using GLOVE in r. My plan was to average the word vectors of a sentence, but I can't seem to get to the word vectorization stage. I've downloaded the glove.6b.50.txt file and it's parent zip file from:…
Travasaurus
  • 601
  • 1
  • 8
  • 26
1
vote
1 answer

How to represent each word occurrence as a separate tcm vector in R?

I am looking for an efficient way to create a term co-occurrence matrix for (each) target word in a corpus, such that each occurrence of the word would constitute its own vector (row) in a tcm, where the columns are the context words (i.e., a…
user3554004
  • 1,044
  • 9
  • 24
1
vote
1 answer

LDA topic model using R text2vec package and LDAvis in shinyApp

Here is the code for LDA topic modelling with R text2vec package: library(text2vec) tokens = docs$text %>% # docs$text: a colection of text documents word_tokenizer it = itoken(tokens, ids = docs$id, progressbar = FALSE) v =…
Sam S.
  • 627
  • 1
  • 7
  • 23
1
vote
2 answers

R function with reference to argument without evaluating it

islands1<-islands #a named num (vector) data.frame(island_col=names(islands1), number_col=islands1,row.names=NULL) This creates a dataframe consisting of two columns, the first contains the names from the named vector and is called "island_col", the…
Will Hauser
  • 197
  • 7
1
vote
1 answer

how to train a lasso with both text and numeric variables?

Consider this modified classic example: library(dplyr) library(tibble) dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "France", …
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
1
vote
0 answers

How to use build classifier (based on word embeddings) on new data for sentiment analysis?

So I used the text2vec R package to build word vectorizations for feature selection. I did that according to Dmitriy Selivanov's page http://text2vec.org/vectorization.html, which explains how to properly use text2vec before building a…
Lucinho91
  • 175
  • 2
  • 4
  • 16
1
vote
0 answers

How to create svm plot with document term matrix from text2vec package in R?

I'm using the text2vec package to create a vocabulary document term matrix as described here: http://text2vec.org/vectorization.html#vectorization In particular, I am using SVM from the e1071 package. I made a similar vocabulary term document matrix…
Kwiebes
  • 43
  • 1
  • 6
1
vote
0 answers

get word vectors for each document

I stumbled upon text2vec package, it implements word embeddings in R. I have been experimenting with it successfully. However, I have been trying implement word vectors onto each document exactly like i found in H2O(python) here…
Shoaibkhanz
  • 1,942
  • 3
  • 24
  • 41
1
vote
2 answers

How do I include stopwords(terms) in text2vec

In text2vec package, I am using create_vocabulary function. For eg: My text is "This book is very good" and suppose I am not using stopwords and an ngram of 1L to 3L. so the vocab terms will be This, book, is, very, good, This book,..... book is…
tej kiran
  • 65
  • 1
  • 8
1
vote
1 answer

ngrams using hash_vectorizer in text2vec

I was trying to create ngrams using hash_vectorizer function in text2vec, when I noticed that it doesn't change the dimensions of my dtm wit changing values. h_vectorizer = hash_vectorizer(hash_size = 2 ^ 14, ngram = c(2L, 10L)) dtm_train =…
Akhil
  • 165
  • 1
  • 1
  • 8